The New York Times is suing OpenAI for training GPT-4 on NYT articles. The Authors Guild is suing for training on copyrighted books. Getty Images sued Stability AI for training Stable Diffusion on watermarked photos. These cases will determine whether scraping copyrighted text for LLM training is fair use or billion-dollar liability. As of March 2026, no definitive ruling exists — the law has not caught up to the technology, and every frontier lab operates in legal gray area.
This post covers the legal landscape as it affects engineering decisions: what the law says, how courts have ruled, what compliance requires technically, and how to build data pipelines that respect opt-out signals and maintain provenance records.
Copyright Fundamentals for Training Data
The Legal Framework
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class Jurisdiction(Enum):
US = "united_states"
EU = "european_union"
UK = "united_kingdom"
JAPAN = "japan"
CHINA = "china"
CANADA = "canada"
class CopyrightStatus(Enum):
PUBLIC_DOMAIN = "public_domain"
CREATIVE_COMMONS = "creative_commons"
COPYRIGHTED_FAIR_USE_LIKELY = "copyrighted_fair_use_likely"
COPYRIGHTED_FAIR_USE_UNCERTAIN = "copyrighted_fair_use_uncertain"
COPYRIGHTED_RESTRICTED = "copyrighted_restricted"
OPT_OUT = "opt_out"
UNKNOWN = "unknown"
@dataclass
class FairUseFactor:
"""One of the four fair use factors (US law)."""
factor_number: int
name: str
analysis: str
favors: str # "plaintiff" or "defendant" or "neutral"
weight: float
@dataclass
class CopyrightAnalysis:
"""Copyright analysis for a data source."""
source: str
jurisdiction: Jurisdiction
status: CopyrightStatus
fair_use_factors: list = field(default_factory=list)
opt_out_signal: Optional[str] = None
license: Optional[str] = None
risk_level: str = "unknown"
notes: str = ""
class FairUseAnalyzer:
"""
Analyze fair use factors for training data.
US copyright law (17 U.S.C. 107) defines four factors:
1. Purpose and character of the use
(commercial vs educational, transformative vs copying)
2. Nature of the copyrighted work
(factual vs creative, published vs unpublished)
3. Amount used relative to the whole
(small excerpt vs entire work)
4. Effect on the market for the original
(does the AI output substitute for the original?)
No single factor is determinative. Courts weigh
all four together.
"""
def analyze_training_use(self, source_type,
use_context):
"""
Analyze fair use for a specific training data source.
"""
factors = []
# Factor 1: Purpose and character
f1 = self._analyze_factor_1(use_context)
factors.append(f1)
# Factor 2: Nature of the work
f2 = self._analyze_factor_2(source_type)
factors.append(f2)
# Factor 3: Amount used
f3 = self._analyze_factor_3(
source_type, use_context
)
factors.append(f3)
# Factor 4: Market effect
f4 = self._analyze_factor_4(
source_type, use_context
)
factors.append(f4)
# Overall assessment
defendant_factors = sum(
1 for f in factors if f.favors == "defendant"
)
plaintiff_factors = sum(
1 for f in factors if f.favors == "plaintiff"
)
if defendant_factors >= 3:
risk = "low"
elif defendant_factors >= 2:
risk = "medium"
else:
risk = "high"
return CopyrightAnalysis(
source=source_type,
jurisdiction=Jurisdiction.US,
status=(
CopyrightStatus.COPYRIGHTED_FAIR_USE_LIKELY
if risk == "low"
else CopyrightStatus.COPYRIGHTED_FAIR_USE_UNCERTAIN
),
fair_use_factors=factors,
risk_level=risk,
)
def _analyze_factor_1(self, use_context):
"""
Factor 1: Purpose and character of the use.
Key question: is the use transformative?
Training a model that generates new text from
learned patterns is likely transformative.
The output is not a copy of the input.
However: commercial use weighs against fair use.
"""
is_commercial = use_context.get(
"commercial", True
)
is_transformative = use_context.get(
"transformative", True
)
if is_transformative and not is_commercial:
favors = "defendant"
analysis = (
"Transformative non-commercial use. "
"Model learns patterns, does not reproduce "
"originals. Strongly favors fair use."
)
elif is_transformative and is_commercial:
favors = "neutral"
analysis = (
"Transformative but commercial. The "
"transformative nature of training weighs "
"for defendant, but commercial purpose "
"weighs for plaintiff. Net: neutral."
)
else:
favors = "plaintiff"
analysis = (
"Non-transformative commercial use. "
"Direct copying for commercial gain."
)
return FairUseFactor(
factor_number=1,
name="Purpose and character of the use",
analysis=analysis,
favors=favors,
weight=0.35,
)
def _analyze_factor_2(self, source_type):
"""
Factor 2: Nature of the copyrighted work.
Factual works get less protection than creative works.
Published works get less protection than unpublished.
"""
creative_sources = {
"fiction", "poetry", "music_lyrics",
"screenplays", "novels",
}
factual_sources = {
"news", "wikipedia", "scientific_papers",
"government_reports", "legal_documents",
}
if source_type in factual_sources:
return FairUseFactor(
factor_number=2,
name="Nature of the copyrighted work",
analysis=(
f"Source type '{source_type}' is primarily "
f"factual. Factual works receive thinner "
f"copyright protection."
),
favors="defendant",
weight=0.15,
)
elif source_type in creative_sources:
return FairUseFactor(
factor_number=2,
name="Nature of the copyrighted work",
analysis=(
f"Source type '{source_type}' is creative. "
f"Creative works receive strong copyright "
f"protection."
),
favors="plaintiff",
weight=0.15,
)
else:
return FairUseFactor(
factor_number=2,
name="Nature of the copyrighted work",
analysis="Mixed factual/creative content.",
favors="neutral",
weight=0.15,
)
def _analyze_factor_3(self, source_type, use_context):
"""
Factor 3: Amount and substantiality used.
Training on full documents weighs against fair use.
Training on snippets or abstracts weighs for it.
"""
amount = use_context.get("amount", "full")
if amount == "snippet":
return FairUseFactor(
factor_number=3,
name="Amount and substantiality",
analysis="Only snippets or abstracts used.",
favors="defendant",
weight=0.20,
)
elif amount == "full":
return FairUseFactor(
factor_number=3,
name="Amount and substantiality",
analysis=(
"Entire works used for training. "
"Weighs against fair use."
),
favors="plaintiff",
weight=0.20,
)
else:
return FairUseFactor(
factor_number=3,
name="Amount and substantiality",
analysis="Partial use.",
favors="neutral",
weight=0.20,
)
def _analyze_factor_4(self, source_type, use_context):
"""
Factor 4: Effect on the market.
Does the AI model substitute for the original?
A chatbot that summarizes NYT articles reduces
demand for NYT subscriptions (market harm).
A code assistant trained on GitHub does not
substitute for the original repositories.
"""
market_substitute = use_context.get(
"market_substitute", False
)
if market_substitute:
return FairUseFactor(
factor_number=4,
name="Effect on the market",
analysis=(
"Model output can substitute for original "
"works, reducing demand for the original. "
"Strongest factor against fair use."
),
favors="plaintiff",
weight=0.30,
)
else:
return FairUseFactor(
factor_number=4,
name="Effect on the market",
analysis=(
"Model output does not directly substitute "
"for the original works. Different market."
),
favors="defendant",
weight=0.30,
)
Landmark AI Copyright Cases (as of early 2026)
| Case | Jurisdiction | Status | Key Issue | Implication for Training Data |
|---|---|---|---|---|
| NYT v. OpenAI (2023-) | US | Ongoing (discovery) | Verbatim memorization of articles | If lost: news content may require licensing |
| Authors Guild v. OpenAI (2023-) | US | Ongoing (class action) | Book-length training data | Class action scale could set broad precedent |
| Getty v. Stability AI (2023) | US/UK | UK: partial ruling for Getty | Image training data | Visual data may have different fair use calculus |
| Doe v. GitHub (Copilot) | US | Settled / ongoing | Code reproduction | Open source licenses may require attribution |
| Thomson Reuters v. Ross | US | Ross lost on fair use | Legal document training | Factual works not automatically fair use |
The legal landscape is moving quickly and any analysis here reflects the state as of early 2026. No court has yet issued a definitive ruling on whether training LLMs on copyrighted data is fair use. Engineering teams should build data pipelines that maintain provenance, respect opt-out signals, and can retroactively remove specific sources if required by future rulings.
Opt-Out Mechanisms
Technical Standards for Data Exclusion
import re
from urllib.parse import urlparse
from dataclasses import dataclass
@dataclass
class OptOutSignal:
"""An opt-out signal detected for a URL or document."""
source_url: str
signal_type: str
signal_value: str
is_blocking: bool
detected_at: str
class OptOutDetector:
"""
Detect opt-out signals from content providers.
Opt-out mechanisms:
1. robots.txt: GPTBot, CCBot, and other AI crawler
user-agents
2. HTTP headers: X-Robots-Tag: noai, noimageai
3. HTML meta tags: robots noai directives
4. C2PA metadata: embedded content credentials
5. TDM reservation: EU Text and Data Mining opt-out
(Article 4 DSM Directive)
6. ai.txt: proposed standard for AI-specific permissions
"""
AI_USER_AGENTS = [
"GPTBot",
"ChatGPT-User",
"CCBot",
"Google-Extended",
"Anthropic-AI",
"ClaudeBot",
"Bytespider",
"PerplexityBot",
"Cohere-AI",
]
def check_robots_txt(self, robots_txt_content, url):
"""
Parse robots.txt for AI crawler blocks.
Returns a list of opt-out signals found.
"""
signals = []
lines = robots_txt_content.split("\n")
current_agent = None
for line in lines:
line = line.strip()
if line.lower().startswith("user-agent:"):
current_agent = line.split(":", 1)[1].strip()
elif line.lower().startswith("disallow:"):
path = line.split(":", 1)[1].strip()
if current_agent in self.AI_USER_AGENTS:
signals.append(
OptOutSignal(
source_url=url,
signal_type="robots_txt",
signal_value=(
f"User-agent: {current_agent}, "
f"Disallow: {path}"
),
is_blocking=path == "/",
detected_at="robots_txt",
)
)
return signals
def check_http_headers(self, headers, url):
"""
Check HTTP response headers for AI opt-out signals.
X-Robots-Tag: noai
X-Robots-Tag: noimageai
"""
signals = []
x_robots = headers.get("X-Robots-Tag", "")
if "noai" in x_robots.lower():
signals.append(
OptOutSignal(
source_url=url,
signal_type="http_header",
signal_value=f"X-Robots-Tag: {x_robots}",
is_blocking=True,
detected_at="http_header",
)
)
return signals
def check_html_meta(self, html_content, url):
"""
Check HTML meta tags for AI opt-out signals.
Patterns detected in the wild.
"""
signals = []
meta_patterns = [
r'<meta\s+name=["\']robots["\']\s+content=["\']([^"\']*noai[^"\']*)["\']',
r'<meta\s+name=["\']ai-training["\']\s+content=["\']([^"\']*)["\']',
]
for pattern in meta_patterns:
matches = re.findall(
pattern, html_content, re.IGNORECASE
)
for match in matches:
signals.append(
OptOutSignal(
source_url=url,
signal_type="html_meta",
signal_value=match,
is_blocking="disallow" in match.lower()
or "noai" in match.lower(),
detected_at="html_meta",
)
)
return signals
def check_tdm_reservation(self, html_content, url):
"""
Check for EU Text and Data Mining reservation.
Article 4 of the DSM Directive allows rights holders
to reserve their rights against TDM. This is
expressed via machine-readable metadata.
"""
signals = []
tdm_patterns = [
r'<meta\s+name=["\']tdm-reservation["\']\s+content=["\']([^"\']*)["\']',
r'"tdm:reservation"\s*:\s*"(true|1)"',
]
for pattern in tdm_patterns:
matches = re.findall(
pattern, html_content, re.IGNORECASE
)
for match in matches:
signals.append(
OptOutSignal(
source_url=url,
signal_type="tdm_reservation",
signal_value=match,
is_blocking=True,
detected_at="html_meta",
)
)
return signals
def check_all(self, url, robots_txt=None,
headers=None, html_content=None):
"""Run all opt-out checks for a URL."""
all_signals = []
if robots_txt:
all_signals.extend(
self.check_robots_txt(robots_txt, url)
)
if headers:
all_signals.extend(
self.check_http_headers(headers, url)
)
if html_content:
all_signals.extend(
self.check_html_meta(html_content, url)
)
all_signals.extend(
self.check_tdm_reservation(
html_content, url
)
)
is_blocked = any(
s.is_blocking for s in all_signals
)
return {
"url": url,
"signals": all_signals,
"is_blocked": is_blocked,
"n_signals": len(all_signals),
}
Opt-Out Adoption by Top 1000 Websites (early 2026)
| Opt-Out Mechanism | Adoption Rate | Coverage | Enforcement | Standard Body |
|---|---|---|---|---|
| robots.txt (GPTBot block) | 35% | OpenAI crawlers | Voluntary (OpenAI honors) | De facto standard |
| robots.txt (Google-Extended) | 28% | Google AI crawlers | Voluntary (Google honors) | De facto standard |
| robots.txt (CCBot block) | 20% | Common Crawl | Voluntary | De facto standard |
| TDM reservation (EU) | 8% | EU jurisdiction only | Legal (DSM Directive) | EU Directive 2019/790 |
| C2PA metadata | 3% | Image/video provenance | Technical (not legal) | C2PA Coalition |
| ai.txt | 1% | All AI crawlers | Proposed, not adopted | Spawning.ai proposal |
Regulatory Landscape
Jurisdiction-by-Jurisdiction Analysis
@dataclass
class RegulatoryRequirement:
"""A specific regulatory requirement for training data."""
jurisdiction: Jurisdiction
regulation: str
requirement: str
applies_to: str
penalty: str
effective_date: str
compliance_action: str
class RegulatoryComplianceChecker:
"""
Check training data pipelines against regulatory
requirements across jurisdictions.
"""
REQUIREMENTS = [
RegulatoryRequirement(
jurisdiction=Jurisdiction.EU,
regulation="EU AI Act (Article 53)",
requirement=(
"Providers of general-purpose AI models must "
"document and make publicly available a "
"sufficiently detailed summary of the content "
"used for training."
),
applies_to="GPAI model providers",
penalty="Up to 3% of global annual turnover",
effective_date="2025-08-02",
compliance_action=(
"Maintain detailed training data provenance. "
"Publish summary using template from AI Office."
),
),
RegulatoryRequirement(
jurisdiction=Jurisdiction.EU,
regulation="DSM Directive (Article 4)",
requirement=(
"Text and data mining is permitted except "
"where rights holders have expressly reserved "
"their rights in a machine-readable manner."
),
applies_to="Anyone performing TDM in EU",
penalty="Copyright infringement liability",
effective_date="2021-06-07",
compliance_action=(
"Check for TDM reservations before scraping. "
"Honor opt-out signals."
),
),
RegulatoryRequirement(
jurisdiction=Jurisdiction.EU,
regulation="GDPR (Articles 6, 17, 22)",
requirement=(
"Processing personal data requires a legal "
"basis. Data subjects have the right to "
"erasure and to not be subject to solely "
"automated decision-making."
),
applies_to="Any entity processing EU personal data",
penalty="Up to 4% of global annual turnover or 20M EUR",
effective_date="2018-05-25",
compliance_action=(
"PII scrubbing pipeline. Right-to-erasure "
"mechanism. DPIA for training data processing."
),
),
RegulatoryRequirement(
jurisdiction=Jurisdiction.US,
regulation="Copyright Act (Section 107 Fair Use)",
requirement=(
"Fair use is determined by four factors. "
"No bright-line rule for AI training. "
"Pending litigation will clarify."
),
applies_to="Anyone using copyrighted content in US",
penalty="Statutory damages up to $150K per work",
effective_date="1976-01-01",
compliance_action=(
"Maintain provenance records. Build ability "
"to remove specific sources retroactively. "
"Monitor litigation outcomes."
),
),
RegulatoryRequirement(
jurisdiction=Jurisdiction.JAPAN,
regulation="Copyright Act (Article 30-4)",
requirement=(
"Use of copyrighted works for information "
"analysis (including AI training) is permitted "
"regardless of the rights holder's intent, "
"as long as it does not unreasonably prejudice "
"the rights holder's interests."
),
applies_to="Anyone performing information analysis in Japan",
penalty="Standard copyright penalties if exception does not apply",
effective_date="2019-01-01",
compliance_action=(
"Document that use is for information analysis. "
"Ensure outputs do not reproduce originals."
),
),
]
def check_compliance(self, data_source, jurisdictions):
"""
Check if a data source is compliant with
regulations in specified jurisdictions.
"""
issues = []
for req in self.REQUIREMENTS:
if req.jurisdiction not in jurisdictions:
continue
issue = self._check_requirement(
data_source, req
)
if issue:
issues.append(issue)
return {
"source": data_source.get("name", ""),
"jurisdictions": [j.value for j in jurisdictions],
"issues": issues,
"compliant": len(issues) == 0,
}
def _check_requirement(self, data_source, requirement):
"""Check a single requirement."""
# Simplified compliance checks
if (
requirement.regulation == "GDPR (Articles 6, 17, 22)"
and data_source.get("contains_pii", False)
and not data_source.get("pii_scrubbed", False)
):
return {
"regulation": requirement.regulation,
"issue": "Contains PII without scrubbing",
"action": requirement.compliance_action,
"severity": "high",
}
if (
"DSM Directive" in requirement.regulation
and data_source.get("tdm_reserved", False)
):
return {
"regulation": requirement.regulation,
"issue": "Source has TDM reservation",
"action": "Exclude from training data",
"severity": "high",
}
return None
Copyright Compliance Pipeline
Production Implementation
import json
from datetime import datetime
class CopyrightCompliancePipeline:
"""
End-to-end pipeline for copyright-compliant
data collection.
Stages:
1. Pre-crawl: check robots.txt, ai.txt
2. Crawl: check HTTP headers, HTML meta tags
3. Post-crawl: classify content, check licenses
4. Provenance: record full chain of custody
5. Audit: generate compliance reports
"""
def __init__(self, config):
self.opt_out_detector = OptOutDetector()
self.fair_use_analyzer = FairUseAnalyzer()
self.compliance_checker = RegulatoryComplianceChecker()
self.provenance_store = {}
self.blocked_domains = set()
def process_url(self, url, robots_txt=None,
headers=None, html_content=None):
"""
Process a single URL through the compliance pipeline.
"""
# Stage 1: Opt-out check
opt_out_result = self.opt_out_detector.check_all(
url, robots_txt, headers, html_content
)
if opt_out_result["is_blocked"]:
self._record_provenance(url, "blocked", opt_out_result)
return {
"url": url,
"decision": "BLOCKED",
"reason": "opt_out_signal_detected",
"signals": opt_out_result["signals"],
}
# Stage 2: License detection
license_info = self._detect_license(
html_content, url
)
# Stage 3: Content classification
content_type = self._classify_content(html_content)
# Stage 4: Fair use analysis
fair_use = self.fair_use_analyzer.analyze_training_use(
source_type=content_type,
use_context={
"commercial": True,
"transformative": True,
"amount": "full",
"market_substitute": (
content_type in ("news", "fiction")
),
},
)
# Stage 5: Record provenance
provenance = {
"url": url,
"crawl_date": datetime.now().isoformat(),
"opt_out_signals": [
s.__dict__
for s in opt_out_result["signals"]
],
"license": license_info,
"content_type": content_type,
"fair_use_risk": fair_use.risk_level,
"decision": (
"INCLUDE" if fair_use.risk_level != "high"
else "REVIEW"
),
}
self._record_provenance(url, "processed", provenance)
return provenance
def _detect_license(self, html_content, url):
"""Detect Creative Commons or other licenses."""
if html_content is None:
return None
cc_patterns = {
"CC-BY": r"creativecommons\.org/licenses/by/",
"CC-BY-SA": r"creativecommons\.org/licenses/by-sa/",
"CC-BY-NC": r"creativecommons\.org/licenses/by-nc/",
"CC-BY-ND": r"creativecommons\.org/licenses/by-nd/",
"CC0": r"creativecommons\.org/publicdomain/zero/",
}
for license_name, pattern in cc_patterns.items():
if re.search(pattern, html_content):
return {
"type": license_name,
"commercial_ok": "NC" not in license_name,
"derivatives_ok": "ND" not in license_name,
}
return None
def _classify_content(self, html_content):
"""Classify content type for fair use analysis."""
if html_content is None:
return "unknown"
# Simplified classification
return "web_content"
def _record_provenance(self, url, status, data):
"""Record provenance for audit trail."""
self.provenance_store[url] = {
"status": status,
"data": data,
"recorded_at": datetime.now().isoformat(),
}
def generate_audit_report(self):
"""
Generate a compliance audit report.
Required by EU AI Act Article 53 for GPAI models.
"""
total = len(self.provenance_store)
blocked = sum(
1 for v in self.provenance_store.values()
if v["status"] == "blocked"
)
included = sum(
1 for v in self.provenance_store.values()
if v["status"] == "processed"
and v["data"].get("decision") == "INCLUDE"
)
review = sum(
1 for v in self.provenance_store.values()
if v["data"].get("decision") == "REVIEW"
)
return {
"report_date": datetime.now().isoformat(),
"total_urls_processed": total,
"blocked_by_opt_out": blocked,
"included": included,
"pending_review": review,
"blocked_rate": (
blocked / total if total else 0
),
"provenance_records": total,
}
def handle_takedown_request(self, urls):
"""
Handle a copyright takedown request.
Must be able to remove specific URLs from
training data and retrain or fine-tune to
reduce memorization of removed content.
"""
removed = []
for url in urls:
if url in self.provenance_store:
self.provenance_store[url]["status"] = (
"removed_takedown"
)
self.blocked_domains.add(
urlparse(url).netloc
)
removed.append(url)
return {
"requested": len(urls),
"removed": len(removed),
"domains_blocked": len(self.blocked_domains),
"requires_retrain": len(removed) > 0,
}
Training Data Sources by Copyright Risk Level
| Metric | Public Domain | CC Licensed | Factual/News | Creative/Fiction | Code (OSS) | User Generated | Unknown |
|---|---|---|---|---|---|---|---|
| Proportion of typical training corpus |
The ability to handle takedown requests retroactively is a hard engineering requirement. If a court orders removal of specific copyrighted content from training data, you must be able to identify which training examples came from that content, remove them, and ideally fine-tune the model to reduce memorization of the removed content. This requires maintaining provenance records from crawl through tokenization through training.
Practical Compliance Checklist
Engineering Requirements
class ComplianceChecklist:
"""
Minimum engineering requirements for copyright-compliant
training data pipelines.
"""
CHECKLIST = [
{
"category": "Opt-Out Compliance",
"items": [
"Check robots.txt for all AI user-agents before crawling",
"Check HTTP X-Robots-Tag headers during crawl",
"Check HTML meta tags for noai directives",
"Check TDM reservation metadata (EU requirement)",
"Re-check opt-out signals monthly (sites update robots.txt)",
"Maintain a blocklist of opted-out domains",
],
},
{
"category": "Provenance Tracking",
"items": [
"Record source URL for every training document",
"Record crawl date and time",
"Record detected license (CC, public domain, etc.)",
"Record opt-out signal status at crawl time",
"Map documents through dedup/filter/tokenization pipeline",
"Enable reverse lookup: training token -> source URL",
],
},
{
"category": "Takedown Mechanism",
"items": [
"Accept and process takedown requests within 72 hours",
"Remove specified URLs from training data storage",
"Add domains to blocklist for future crawls",
"Document removal for compliance audit",
"Evaluate whether model retraining is needed",
],
},
{
"category": "EU AI Act Compliance",
"items": [
"Prepare training data summary (Article 53 template)",
"Document copyright policy and fair use analysis",
"Implement TDM reservation checking",
"GDPR: PII scrubbing and right-to-erasure mechanism",
"Appoint an EU authorized representative if outside EU",
],
},
{
"category": "PII and Privacy",
"items": [
"Multi-stage PII scrubbing (regex + NER)",
"GDPR right-to-erasure mechanism",
"Data minimization: do not retain unnecessary personal data",
"Data protection impact assessment (DPIA)",
"Cross-border transfer safeguards (SCCs or adequacy)",
],
},
]
def audit(self, pipeline_state):
"""
Run compliance audit against checklist.
Returns pass/fail for each item.
"""
results = []
for category in self.CHECKLIST:
category_results = {
"category": category["category"],
"items": [],
"pass_rate": 0.0,
}
passed = 0
for item in category["items"]:
is_implemented = pipeline_state.get(
item, False
)
category_results["items"].append({
"item": item,
"status": (
"PASS" if is_implemented else "FAIL"
),
})
if is_implemented:
passed += 1
category_results["pass_rate"] = (
passed / len(category["items"])
)
results.append(category_results)
return results
Key Takeaways
Copyright law and AI training exist in a state of legal uncertainty. No jurisdiction has fully resolved whether training on copyrighted data is permissible. Engineering teams must build for compliance under multiple possible future outcomes.
The critical decisions:
-
Opt-out signals must be checked and honored: 35% of top websites block GPTBot via robots.txt. Ignoring opt-out signals creates legal risk and reputational damage. Check robots.txt, HTTP headers, HTML meta tags, and TDM reservations before including any content.
-
Provenance records are legally required in the EU: The EU AI Act (Article 53) requires GPAI providers to publish a training data summary. Even outside the EU, provenance enables takedown compliance and fair use defense (you can demonstrate what was and was not in training data).
-
Takedown capability is a hard requirement: Courts may order removal of specific copyrighted content. Without the ability to trace training tokens back to source URLs and remove them, compliance is impossible. Build this capability from day one.
-
Fair use is not a blanket defense: Each source type has a different fair use profile. Factual content (news, Wikipedia, papers) has a stronger fair use argument than creative content (fiction, poetry, music). But even factual content can fail on Factor 4 (market substitution) if the model’s output directly competes with the original.
-
Japan is the safest jurisdiction; EU is the most restrictive: Japan’s Article 30-4 explicitly permits AI training. The EU requires opt-out compliance (DSM Directive), training data disclosure (AI Act), and PII handling (GDPR). US law is uncertain pending litigation. Build for the most restrictive jurisdiction your model will serve.