Part of Series The Dataset Frontier 26 of 27
1 Synthetic Data Pipelines: Magpie, Nemotron-4, and Generating Training Data at Scale 2 Data Curation at Scale: DCLM, FineWeb-Edu, and the Exact Heuristics That Filter the Web 3 Agent-Based Simulation: Using 10,000 AI Agents to Generate Synthetic Training Data 4 Code Dataset Curation: Deduplication, License Filtering, and Quality Scoring for LLM Training 5 Multilingual Data: Cross-Lingual Transfer, Low-Resource Languages, and Translation Quality 6 Instruction Tuning Data: ShareGPT, OpenAssistant, and Quality Metrics for Alignment 7 Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback 8 Data Mixing: Optimal Proportions of Code, Math, Web, and Books for LLM Training 9 Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability 10 Data Contamination: Detecting and Preventing Benchmark Leakage in Training Data 11 The Data Scaling Law: How Much Data Is Enough, and What Happens When You Run Out 12 Training a Tokenizer from Scratch: BPE Merge Rules, Vocabulary Optimization, and Compression Ratio 13 Multimodal Training Data: Image-Text Pairs, Video Captioning, and Interleaved Document Formats 14 RLHF Data at Scale: Collecting Millions of Human Preferences with Minimal Cost 15 Building a Decontamination Pipeline: Removing Benchmark Data from Training Corpora 16 Safety Training Data: Red Teaming, Refusal Training, and Building Datasets for Harmless AI 17 Data Versioning and Reproducibility: Tracking What Changed Between Training Runs 18 Domain-Specific Data: Building Medical, Legal, and Financial Training Datasets 19 Data Attribution and Provenance: Tracing Model Outputs Back to Training Examples 20 The Data Flywheel: Using Production Logs to Continuously Improve Training Data 21 Reward Model Training Data: Building Datasets for Math Verification and Code Correctness 22 Long-Context Training Data: Book-Length Documents, Multi-Document QA, and Needle-in-Haystack 23 Agentic Interaction Data: Tool Use Traces, Multi-Step Planning Logs, and Environment Feedback 24 Data Labeling Platforms: Scale AI, Surge AI, and Building Your Own Annotation Pipeline 25 Data Legal Issues: Copyright, Fair Use, Opt-Out, and the Regulatory Landscape for Training Data 26 Data Pipeline at Scale: Spark, Ray, and Processing 15 Trillion Tokens Across 1000 Nodes 27 Building a Data Pipeline: From Raw HTML to Clean Training Tokens in 500 Lines

The New York Times is suing OpenAI for training GPT-4 on NYT articles. The Authors Guild is suing for training on copyrighted books. Getty Images sued Stability AI for training Stable Diffusion on watermarked photos. These cases will determine whether scraping copyrighted text for LLM training is fair use or billion-dollar liability. As of March 2026, no definitive ruling exists — the law has not caught up to the technology, and every frontier lab operates in legal gray area.

This post covers the legal landscape as it affects engineering decisions: what the law says, how courts have ruled, what compliance requires technically, and how to build data pipelines that respect opt-out signals and maintain provenance records.

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional

class Jurisdiction(Enum):
    US = "united_states"
    EU = "european_union"
    UK = "united_kingdom"
    JAPAN = "japan"
    CHINA = "china"
    CANADA = "canada"

class CopyrightStatus(Enum):
    PUBLIC_DOMAIN = "public_domain"
    CREATIVE_COMMONS = "creative_commons"
    COPYRIGHTED_FAIR_USE_LIKELY = "copyrighted_fair_use_likely"
    COPYRIGHTED_FAIR_USE_UNCERTAIN = "copyrighted_fair_use_uncertain"
    COPYRIGHTED_RESTRICTED = "copyrighted_restricted"
    OPT_OUT = "opt_out"
    UNKNOWN = "unknown"

@dataclass
class FairUseFactor:
    """One of the four fair use factors (US law)."""
    factor_number: int
    name: str
    analysis: str
    favors: str  # "plaintiff" or "defendant" or "neutral"
    weight: float

@dataclass
class CopyrightAnalysis:
    """Copyright analysis for a data source."""
    source: str
    jurisdiction: Jurisdiction
    status: CopyrightStatus
    fair_use_factors: list = field(default_factory=list)
    opt_out_signal: Optional[str] = None
    license: Optional[str] = None
    risk_level: str = "unknown"
    notes: str = ""

class FairUseAnalyzer:
    """
    Analyze fair use factors for training data.

    US copyright law (17 U.S.C. 107) defines four factors:
    1. Purpose and character of the use
       (commercial vs educational, transformative vs copying)
    2. Nature of the copyrighted work
       (factual vs creative, published vs unpublished)
    3. Amount used relative to the whole
       (small excerpt vs entire work)
    4. Effect on the market for the original
       (does the AI output substitute for the original?)

    No single factor is determinative. Courts weigh
    all four together.
    """

    def analyze_training_use(self, source_type,
                              use_context):
        """
        Analyze fair use for a specific training data source.
        """
        factors = []

        # Factor 1: Purpose and character
        f1 = self._analyze_factor_1(use_context)
        factors.append(f1)

        # Factor 2: Nature of the work
        f2 = self._analyze_factor_2(source_type)
        factors.append(f2)

        # Factor 3: Amount used
        f3 = self._analyze_factor_3(
            source_type, use_context
        )
        factors.append(f3)

        # Factor 4: Market effect
        f4 = self._analyze_factor_4(
            source_type, use_context
        )
        factors.append(f4)

        # Overall assessment
        defendant_factors = sum(
            1 for f in factors if f.favors == "defendant"
        )
        plaintiff_factors = sum(
            1 for f in factors if f.favors == "plaintiff"
        )

        if defendant_factors >= 3:
            risk = "low"
        elif defendant_factors >= 2:
            risk = "medium"
        else:
            risk = "high"

        return CopyrightAnalysis(
            source=source_type,
            jurisdiction=Jurisdiction.US,
            status=(
                CopyrightStatus.COPYRIGHTED_FAIR_USE_LIKELY
                if risk == "low"
                else CopyrightStatus.COPYRIGHTED_FAIR_USE_UNCERTAIN
            ),
            fair_use_factors=factors,
            risk_level=risk,
        )

    def _analyze_factor_1(self, use_context):
        """
        Factor 1: Purpose and character of the use.

        Key question: is the use transformative?
        Training a model that generates new text from
        learned patterns is likely transformative.
        The output is not a copy of the input.

        However: commercial use weighs against fair use.
        """
        is_commercial = use_context.get(
            "commercial", True
        )
        is_transformative = use_context.get(
            "transformative", True
        )

        if is_transformative and not is_commercial:
            favors = "defendant"
            analysis = (
                "Transformative non-commercial use. "
                "Model learns patterns, does not reproduce "
                "originals. Strongly favors fair use."
            )
        elif is_transformative and is_commercial:
            favors = "neutral"
            analysis = (
                "Transformative but commercial. The "
                "transformative nature of training weighs "
                "for defendant, but commercial purpose "
                "weighs for plaintiff. Net: neutral."
            )
        else:
            favors = "plaintiff"
            analysis = (
                "Non-transformative commercial use. "
                "Direct copying for commercial gain."
            )

        return FairUseFactor(
            factor_number=1,
            name="Purpose and character of the use",
            analysis=analysis,
            favors=favors,
            weight=0.35,
        )

    def _analyze_factor_2(self, source_type):
        """
        Factor 2: Nature of the copyrighted work.

        Factual works get less protection than creative works.
        Published works get less protection than unpublished.
        """
        creative_sources = {
            "fiction", "poetry", "music_lyrics",
            "screenplays", "novels",
        }
        factual_sources = {
            "news", "wikipedia", "scientific_papers",
            "government_reports", "legal_documents",
        }

        if source_type in factual_sources:
            return FairUseFactor(
                factor_number=2,
                name="Nature of the copyrighted work",
                analysis=(
                    f"Source type '{source_type}' is primarily "
                    f"factual. Factual works receive thinner "
                    f"copyright protection."
                ),
                favors="defendant",
                weight=0.15,
            )
        elif source_type in creative_sources:
            return FairUseFactor(
                factor_number=2,
                name="Nature of the copyrighted work",
                analysis=(
                    f"Source type '{source_type}' is creative. "
                    f"Creative works receive strong copyright "
                    f"protection."
                ),
                favors="plaintiff",
                weight=0.15,
            )
        else:
            return FairUseFactor(
                factor_number=2,
                name="Nature of the copyrighted work",
                analysis="Mixed factual/creative content.",
                favors="neutral",
                weight=0.15,
            )

    def _analyze_factor_3(self, source_type, use_context):
        """
        Factor 3: Amount and substantiality used.

        Training on full documents weighs against fair use.
        Training on snippets or abstracts weighs for it.
        """
        amount = use_context.get("amount", "full")

        if amount == "snippet":
            return FairUseFactor(
                factor_number=3,
                name="Amount and substantiality",
                analysis="Only snippets or abstracts used.",
                favors="defendant",
                weight=0.20,
            )
        elif amount == "full":
            return FairUseFactor(
                factor_number=3,
                name="Amount and substantiality",
                analysis=(
                    "Entire works used for training. "
                    "Weighs against fair use."
                ),
                favors="plaintiff",
                weight=0.20,
            )
        else:
            return FairUseFactor(
                factor_number=3,
                name="Amount and substantiality",
                analysis="Partial use.",
                favors="neutral",
                weight=0.20,
            )

    def _analyze_factor_4(self, source_type, use_context):
        """
        Factor 4: Effect on the market.

        Does the AI model substitute for the original?
        A chatbot that summarizes NYT articles reduces
        demand for NYT subscriptions (market harm).
        A code assistant trained on GitHub does not
        substitute for the original repositories.
        """
        market_substitute = use_context.get(
            "market_substitute", False
        )

        if market_substitute:
            return FairUseFactor(
                factor_number=4,
                name="Effect on the market",
                analysis=(
                    "Model output can substitute for original "
                    "works, reducing demand for the original. "
                    "Strongest factor against fair use."
                ),
                favors="plaintiff",
                weight=0.30,
            )
        else:
            return FairUseFactor(
                factor_number=4,
                name="Effect on the market",
                analysis=(
                    "Model output does not directly substitute "
                    "for the original works. Different market."
                ),
                favors="defendant",
                weight=0.30,
            )
📊

Landmark AI Copyright Cases (as of early 2026)

CaseJurisdictionStatusKey IssueImplication for Training Data
NYT v. OpenAI (2023-) US Ongoing (discovery) Verbatim memorization of articles If lost: news content may require licensing
Authors Guild v. OpenAI (2023-) US Ongoing (class action) Book-length training data Class action scale could set broad precedent
Getty v. Stability AI (2023) US/UK UK: partial ruling for Getty Image training data Visual data may have different fair use calculus
Doe v. GitHub (Copilot) US Settled / ongoing Code reproduction Open source licenses may require attribution
Thomson Reuters v. Ross US Ross lost on fair use Legal document training Factual works not automatically fair use
🚨 Danger

The legal landscape is moving quickly and any analysis here reflects the state as of early 2026. No court has yet issued a definitive ruling on whether training LLMs on copyrighted data is fair use. Engineering teams should build data pipelines that maintain provenance, respect opt-out signals, and can retroactively remove specific sources if required by future rulings.

Opt-Out Mechanisms

Technical Standards for Data Exclusion

import re
from urllib.parse import urlparse
from dataclasses import dataclass

@dataclass
class OptOutSignal:
    """An opt-out signal detected for a URL or document."""
    source_url: str
    signal_type: str
    signal_value: str
    is_blocking: bool
    detected_at: str

class OptOutDetector:
    """
    Detect opt-out signals from content providers.

    Opt-out mechanisms:
    1. robots.txt: GPTBot, CCBot, and other AI crawler
       user-agents
    2. HTTP headers: X-Robots-Tag: noai, noimageai
    3. HTML meta tags: robots noai directives
    4. C2PA metadata: embedded content credentials
    5. TDM reservation: EU Text and Data Mining opt-out
       (Article 4 DSM Directive)
    6. ai.txt: proposed standard for AI-specific permissions
    """

    AI_USER_AGENTS = [
        "GPTBot",
        "ChatGPT-User",
        "CCBot",
        "Google-Extended",
        "Anthropic-AI",
        "ClaudeBot",
        "Bytespider",
        "PerplexityBot",
        "Cohere-AI",
    ]

    def check_robots_txt(self, robots_txt_content, url):
        """
        Parse robots.txt for AI crawler blocks.

        Returns a list of opt-out signals found.
        """
        signals = []
        lines = robots_txt_content.split("\n")

        current_agent = None
        for line in lines:
            line = line.strip()

            if line.lower().startswith("user-agent:"):
                current_agent = line.split(":", 1)[1].strip()

            elif line.lower().startswith("disallow:"):
                path = line.split(":", 1)[1].strip()

                if current_agent in self.AI_USER_AGENTS:
                    signals.append(
                        OptOutSignal(
                            source_url=url,
                            signal_type="robots_txt",
                            signal_value=(
                                f"User-agent: {current_agent}, "
                                f"Disallow: {path}"
                            ),
                            is_blocking=path == "/",
                            detected_at="robots_txt",
                        )
                    )

        return signals

    def check_http_headers(self, headers, url):
        """
        Check HTTP response headers for AI opt-out signals.

        X-Robots-Tag: noai
        X-Robots-Tag: noimageai
        """
        signals = []

        x_robots = headers.get("X-Robots-Tag", "")
        if "noai" in x_robots.lower():
            signals.append(
                OptOutSignal(
                    source_url=url,
                    signal_type="http_header",
                    signal_value=f"X-Robots-Tag: {x_robots}",
                    is_blocking=True,
                    detected_at="http_header",
                )
            )

        return signals

    def check_html_meta(self, html_content, url):
        """
        Check HTML meta tags for AI opt-out signals.

        Patterns detected in the wild.
        """
        signals = []

        meta_patterns = [
            r'<meta\s+name=["\']robots["\']\s+content=["\']([^"\']*noai[^"\']*)["\']',
            r'<meta\s+name=["\']ai-training["\']\s+content=["\']([^"\']*)["\']',
        ]

        for pattern in meta_patterns:
            matches = re.findall(
                pattern, html_content, re.IGNORECASE
            )
            for match in matches:
                signals.append(
                    OptOutSignal(
                        source_url=url,
                        signal_type="html_meta",
                        signal_value=match,
                        is_blocking="disallow" in match.lower()
                        or "noai" in match.lower(),
                        detected_at="html_meta",
                    )
                )

        return signals

    def check_tdm_reservation(self, html_content, url):
        """
        Check for EU Text and Data Mining reservation.

        Article 4 of the DSM Directive allows rights holders
        to reserve their rights against TDM. This is
        expressed via machine-readable metadata.
        """
        signals = []

        tdm_patterns = [
            r'<meta\s+name=["\']tdm-reservation["\']\s+content=["\']([^"\']*)["\']',
            r'"tdm:reservation"\s*:\s*"(true|1)"',
        ]

        for pattern in tdm_patterns:
            matches = re.findall(
                pattern, html_content, re.IGNORECASE
            )
            for match in matches:
                signals.append(
                    OptOutSignal(
                        source_url=url,
                        signal_type="tdm_reservation",
                        signal_value=match,
                        is_blocking=True,
                        detected_at="html_meta",
                    )
                )

        return signals

    def check_all(self, url, robots_txt=None,
                   headers=None, html_content=None):
        """Run all opt-out checks for a URL."""
        all_signals = []

        if robots_txt:
            all_signals.extend(
                self.check_robots_txt(robots_txt, url)
            )
        if headers:
            all_signals.extend(
                self.check_http_headers(headers, url)
            )
        if html_content:
            all_signals.extend(
                self.check_html_meta(html_content, url)
            )
            all_signals.extend(
                self.check_tdm_reservation(
                    html_content, url
                )
            )

        is_blocked = any(
            s.is_blocking for s in all_signals
        )

        return {
            "url": url,
            "signals": all_signals,
            "is_blocked": is_blocked,
            "n_signals": len(all_signals),
        }
📊

Opt-Out Adoption by Top 1000 Websites (early 2026)

Opt-Out MechanismAdoption RateCoverageEnforcementStandard Body
robots.txt (GPTBot block) 35% OpenAI crawlers Voluntary (OpenAI honors) De facto standard
robots.txt (Google-Extended) 28% Google AI crawlers Voluntary (Google honors) De facto standard
robots.txt (CCBot block) 20% Common Crawl Voluntary De facto standard
TDM reservation (EU) 8% EU jurisdiction only Legal (DSM Directive) EU Directive 2019/790
C2PA metadata 3% Image/video provenance Technical (not legal) C2PA Coalition
ai.txt 1% All AI crawlers Proposed, not adopted Spawning.ai proposal

Regulatory Landscape

Jurisdiction-by-Jurisdiction Analysis

@dataclass
class RegulatoryRequirement:
    """A specific regulatory requirement for training data."""
    jurisdiction: Jurisdiction
    regulation: str
    requirement: str
    applies_to: str
    penalty: str
    effective_date: str
    compliance_action: str

class RegulatoryComplianceChecker:
    """
    Check training data pipelines against regulatory
    requirements across jurisdictions.
    """

    REQUIREMENTS = [
        RegulatoryRequirement(
            jurisdiction=Jurisdiction.EU,
            regulation="EU AI Act (Article 53)",
            requirement=(
                "Providers of general-purpose AI models must "
                "document and make publicly available a "
                "sufficiently detailed summary of the content "
                "used for training."
            ),
            applies_to="GPAI model providers",
            penalty="Up to 3% of global annual turnover",
            effective_date="2025-08-02",
            compliance_action=(
                "Maintain detailed training data provenance. "
                "Publish summary using template from AI Office."
            ),
        ),
        RegulatoryRequirement(
            jurisdiction=Jurisdiction.EU,
            regulation="DSM Directive (Article 4)",
            requirement=(
                "Text and data mining is permitted except "
                "where rights holders have expressly reserved "
                "their rights in a machine-readable manner."
            ),
            applies_to="Anyone performing TDM in EU",
            penalty="Copyright infringement liability",
            effective_date="2021-06-07",
            compliance_action=(
                "Check for TDM reservations before scraping. "
                "Honor opt-out signals."
            ),
        ),
        RegulatoryRequirement(
            jurisdiction=Jurisdiction.EU,
            regulation="GDPR (Articles 6, 17, 22)",
            requirement=(
                "Processing personal data requires a legal "
                "basis. Data subjects have the right to "
                "erasure and to not be subject to solely "
                "automated decision-making."
            ),
            applies_to="Any entity processing EU personal data",
            penalty="Up to 4% of global annual turnover or 20M EUR",
            effective_date="2018-05-25",
            compliance_action=(
                "PII scrubbing pipeline. Right-to-erasure "
                "mechanism. DPIA for training data processing."
            ),
        ),
        RegulatoryRequirement(
            jurisdiction=Jurisdiction.US,
            regulation="Copyright Act (Section 107 Fair Use)",
            requirement=(
                "Fair use is determined by four factors. "
                "No bright-line rule for AI training. "
                "Pending litigation will clarify."
            ),
            applies_to="Anyone using copyrighted content in US",
            penalty="Statutory damages up to $150K per work",
            effective_date="1976-01-01",
            compliance_action=(
                "Maintain provenance records. Build ability "
                "to remove specific sources retroactively. "
                "Monitor litigation outcomes."
            ),
        ),
        RegulatoryRequirement(
            jurisdiction=Jurisdiction.JAPAN,
            regulation="Copyright Act (Article 30-4)",
            requirement=(
                "Use of copyrighted works for information "
                "analysis (including AI training) is permitted "
                "regardless of the rights holder's intent, "
                "as long as it does not unreasonably prejudice "
                "the rights holder's interests."
            ),
            applies_to="Anyone performing information analysis in Japan",
            penalty="Standard copyright penalties if exception does not apply",
            effective_date="2019-01-01",
            compliance_action=(
                "Document that use is for information analysis. "
                "Ensure outputs do not reproduce originals."
            ),
        ),
    ]

    def check_compliance(self, data_source, jurisdictions):
        """
        Check if a data source is compliant with
        regulations in specified jurisdictions.
        """
        issues = []

        for req in self.REQUIREMENTS:
            if req.jurisdiction not in jurisdictions:
                continue

            issue = self._check_requirement(
                data_source, req
            )
            if issue:
                issues.append(issue)

        return {
            "source": data_source.get("name", ""),
            "jurisdictions": [j.value for j in jurisdictions],
            "issues": issues,
            "compliant": len(issues) == 0,
        }

    def _check_requirement(self, data_source, requirement):
        """Check a single requirement."""
        # Simplified compliance checks
        if (
            requirement.regulation == "GDPR (Articles 6, 17, 22)"
            and data_source.get("contains_pii", False)
            and not data_source.get("pii_scrubbed", False)
        ):
            return {
                "regulation": requirement.regulation,
                "issue": "Contains PII without scrubbing",
                "action": requirement.compliance_action,
                "severity": "high",
            }

        if (
            "DSM Directive" in requirement.regulation
            and data_source.get("tdm_reserved", False)
        ):
            return {
                "regulation": requirement.regulation,
                "issue": "Source has TDM reservation",
                "action": "Exclude from training data",
                "severity": "high",
            }

        return None

Production Implementation

import json
from datetime import datetime

class CopyrightCompliancePipeline:
    """
    End-to-end pipeline for copyright-compliant
    data collection.

    Stages:
    1. Pre-crawl: check robots.txt, ai.txt
    2. Crawl: check HTTP headers, HTML meta tags
    3. Post-crawl: classify content, check licenses
    4. Provenance: record full chain of custody
    5. Audit: generate compliance reports
    """

    def __init__(self, config):
        self.opt_out_detector = OptOutDetector()
        self.fair_use_analyzer = FairUseAnalyzer()
        self.compliance_checker = RegulatoryComplianceChecker()
        self.provenance_store = {}
        self.blocked_domains = set()

    def process_url(self, url, robots_txt=None,
                     headers=None, html_content=None):
        """
        Process a single URL through the compliance pipeline.
        """
        # Stage 1: Opt-out check
        opt_out_result = self.opt_out_detector.check_all(
            url, robots_txt, headers, html_content
        )

        if opt_out_result["is_blocked"]:
            self._record_provenance(url, "blocked", opt_out_result)
            return {
                "url": url,
                "decision": "BLOCKED",
                "reason": "opt_out_signal_detected",
                "signals": opt_out_result["signals"],
            }

        # Stage 2: License detection
        license_info = self._detect_license(
            html_content, url
        )

        # Stage 3: Content classification
        content_type = self._classify_content(html_content)

        # Stage 4: Fair use analysis
        fair_use = self.fair_use_analyzer.analyze_training_use(
            source_type=content_type,
            use_context={
                "commercial": True,
                "transformative": True,
                "amount": "full",
                "market_substitute": (
                    content_type in ("news", "fiction")
                ),
            },
        )

        # Stage 5: Record provenance
        provenance = {
            "url": url,
            "crawl_date": datetime.now().isoformat(),
            "opt_out_signals": [
                s.__dict__
                for s in opt_out_result["signals"]
            ],
            "license": license_info,
            "content_type": content_type,
            "fair_use_risk": fair_use.risk_level,
            "decision": (
                "INCLUDE" if fair_use.risk_level != "high"
                else "REVIEW"
            ),
        }

        self._record_provenance(url, "processed", provenance)

        return provenance

    def _detect_license(self, html_content, url):
        """Detect Creative Commons or other licenses."""
        if html_content is None:
            return None

        cc_patterns = {
            "CC-BY": r"creativecommons\.org/licenses/by/",
            "CC-BY-SA": r"creativecommons\.org/licenses/by-sa/",
            "CC-BY-NC": r"creativecommons\.org/licenses/by-nc/",
            "CC-BY-ND": r"creativecommons\.org/licenses/by-nd/",
            "CC0": r"creativecommons\.org/publicdomain/zero/",
        }

        for license_name, pattern in cc_patterns.items():
            if re.search(pattern, html_content):
                return {
                    "type": license_name,
                    "commercial_ok": "NC" not in license_name,
                    "derivatives_ok": "ND" not in license_name,
                }

        return None

    def _classify_content(self, html_content):
        """Classify content type for fair use analysis."""
        if html_content is None:
            return "unknown"
        # Simplified classification
        return "web_content"

    def _record_provenance(self, url, status, data):
        """Record provenance for audit trail."""
        self.provenance_store[url] = {
            "status": status,
            "data": data,
            "recorded_at": datetime.now().isoformat(),
        }

    def generate_audit_report(self):
        """
        Generate a compliance audit report.

        Required by EU AI Act Article 53 for GPAI models.
        """
        total = len(self.provenance_store)
        blocked = sum(
            1 for v in self.provenance_store.values()
            if v["status"] == "blocked"
        )
        included = sum(
            1 for v in self.provenance_store.values()
            if v["status"] == "processed"
            and v["data"].get("decision") == "INCLUDE"
        )
        review = sum(
            1 for v in self.provenance_store.values()
            if v["data"].get("decision") == "REVIEW"
        )

        return {
            "report_date": datetime.now().isoformat(),
            "total_urls_processed": total,
            "blocked_by_opt_out": blocked,
            "included": included,
            "pending_review": review,
            "blocked_rate": (
                blocked / total if total else 0
            ),
            "provenance_records": total,
        }

    def handle_takedown_request(self, urls):
        """
        Handle a copyright takedown request.

        Must be able to remove specific URLs from
        training data and retrain or fine-tune to
        reduce memorization of removed content.
        """
        removed = []
        for url in urls:
            if url in self.provenance_store:
                self.provenance_store[url]["status"] = (
                    "removed_takedown"
                )
                self.blocked_domains.add(
                    urlparse(url).netloc
                )
                removed.append(url)

        return {
            "requested": len(urls),
            "removed": len(removed),
            "domains_blocked": len(self.blocked_domains),
            "requires_retrain": len(removed) > 0,
        }

Training Data Sources by Copyright Risk Level

Metric Public DomainCC LicensedFactual/NewsCreative/FictionCode (OSS)User GeneratedUnknown
Proportion of typical training corpus
5
8
20
3
15
35
14
⚠️ Warning

The ability to handle takedown requests retroactively is a hard engineering requirement. If a court orders removal of specific copyrighted content from training data, you must be able to identify which training examples came from that content, remove them, and ideally fine-tune the model to reduce memorization of the removed content. This requires maintaining provenance records from crawl through tokenization through training.

Practical Compliance Checklist

Engineering Requirements

class ComplianceChecklist:
    """
    Minimum engineering requirements for copyright-compliant
    training data pipelines.
    """

    CHECKLIST = [
        {
            "category": "Opt-Out Compliance",
            "items": [
                "Check robots.txt for all AI user-agents before crawling",
                "Check HTTP X-Robots-Tag headers during crawl",
                "Check HTML meta tags for noai directives",
                "Check TDM reservation metadata (EU requirement)",
                "Re-check opt-out signals monthly (sites update robots.txt)",
                "Maintain a blocklist of opted-out domains",
            ],
        },
        {
            "category": "Provenance Tracking",
            "items": [
                "Record source URL for every training document",
                "Record crawl date and time",
                "Record detected license (CC, public domain, etc.)",
                "Record opt-out signal status at crawl time",
                "Map documents through dedup/filter/tokenization pipeline",
                "Enable reverse lookup: training token -> source URL",
            ],
        },
        {
            "category": "Takedown Mechanism",
            "items": [
                "Accept and process takedown requests within 72 hours",
                "Remove specified URLs from training data storage",
                "Add domains to blocklist for future crawls",
                "Document removal for compliance audit",
                "Evaluate whether model retraining is needed",
            ],
        },
        {
            "category": "EU AI Act Compliance",
            "items": [
                "Prepare training data summary (Article 53 template)",
                "Document copyright policy and fair use analysis",
                "Implement TDM reservation checking",
                "GDPR: PII scrubbing and right-to-erasure mechanism",
                "Appoint an EU authorized representative if outside EU",
            ],
        },
        {
            "category": "PII and Privacy",
            "items": [
                "Multi-stage PII scrubbing (regex + NER)",
                "GDPR right-to-erasure mechanism",
                "Data minimization: do not retain unnecessary personal data",
                "Data protection impact assessment (DPIA)",
                "Cross-border transfer safeguards (SCCs or adequacy)",
            ],
        },
    ]

    def audit(self, pipeline_state):
        """
        Run compliance audit against checklist.
        Returns pass/fail for each item.
        """
        results = []

        for category in self.CHECKLIST:
            category_results = {
                "category": category["category"],
                "items": [],
                "pass_rate": 0.0,
            }

            passed = 0
            for item in category["items"]:
                is_implemented = pipeline_state.get(
                    item, False
                )
                category_results["items"].append({
                    "item": item,
                    "status": (
                        "PASS" if is_implemented else "FAIL"
                    ),
                })
                if is_implemented:
                    passed += 1

            category_results["pass_rate"] = (
                passed / len(category["items"])
            )
            results.append(category_results)

        return results

Key Takeaways

Copyright law and AI training exist in a state of legal uncertainty. No jurisdiction has fully resolved whether training on copyrighted data is permissible. Engineering teams must build for compliance under multiple possible future outcomes.

The critical decisions:

  1. Opt-out signals must be checked and honored: 35% of top websites block GPTBot via robots.txt. Ignoring opt-out signals creates legal risk and reputational damage. Check robots.txt, HTTP headers, HTML meta tags, and TDM reservations before including any content.

  2. Provenance records are legally required in the EU: The EU AI Act (Article 53) requires GPAI providers to publish a training data summary. Even outside the EU, provenance enables takedown compliance and fair use defense (you can demonstrate what was and was not in training data).

  3. Takedown capability is a hard requirement: Courts may order removal of specific copyrighted content. Without the ability to trace training tokens back to source URLs and remove them, compliance is impossible. Build this capability from day one.

  4. Fair use is not a blanket defense: Each source type has a different fair use profile. Factual content (news, Wikipedia, papers) has a stronger fair use argument than creative content (fiction, poetry, music). But even factual content can fail on Factor 4 (market substitution) if the model’s output directly competes with the original.

  5. Japan is the safest jurisdiction; EU is the most restrictive: Japan’s Article 30-4 explicitly permits AI training. The EU requires opt-out compliance (DSM Directive), training data disclosure (AI Act), and PII handling (GDPR). US law is uncertain pending litigation. Build for the most restrictive jurisdiction your model will serve.