Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself

Part of Series Frontier Model Architectures 31 of 27

1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens 4 Llama 3 Architecture Decisions: Why Meta Chose Dense, GQA-8, 128K Vocab, and RoPE 5 Qwen 2.5: Alibaba's Architecture, Training Recipe, and What Makes It Competitive 6 Gemini: Google's Natively Multimodal Architecture and the 1M Token Context 7 Claude Architecture: Constitutional AI, RLHF at Scale, and the 200K Context Window 8 Grok: xAI's Architecture, Massive Scale, and Real-Time Information Integration 9 Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis 10 DeepSeek-R1: The Architecture of Reasoning — GRPO Training, Multi-Stage Pipeline, and Open Weights 11 Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters 12 Mistral and the Sliding Window: Efficient Long-Context with Linear Memory 13 Llama 4: Meta's Shift to Multimodal MoE and What It Signals 14 Training Infrastructure: How Frontier Labs Build Their GPU Clusters 15 Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure 16 Jamba: AI21's Hybrid Mamba-Attention Architecture 17 Yi Series: 01.AI's Bilingual Architecture 18 DBRX: Databricks' Enterprise MoE Architecture 19 OpenAI o1: Reasoning Compute Budgets and Internal CoT 20 Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale 21 Llama 4: Meta's Shift to Multimodal MoE 22 Scaling Laws and Model Design: How Chinchilla Changed Architecture Decisions 23 Open Weight Release Strategy: Llama vs Mistral vs DeepSeek — Licensing and Ecosystem Impact 24 Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself 25 Multimodal Model Comparison 2026: GPT-4o vs Gemini vs Claude vs Llama Vision 26 MoE vs Dense in Production: Serving Cost, Latency, and When Each Wins 27 Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison

Safety guardrails add 12-45ms of latency to every request. RLHF bakes refusal behavior into weights but can be jailbroken with adversarial prompts. System prompt injection catches obvious attacks but doubles token count for long conversations. Output classifiers run asynchronously but miss implicit harms. Every layer costs something — latency, quality degradation, or bypass risk — and frontier labs must choose which trade-offs they accept. No safety system is perfect; the question is which 5% of attacks you are willing to let through.

Training-Time Alignment: RLHF Pipeline

Reinforcement Learning from Human Feedback (RLHF) is the primary mechanism for encoding safety constraints into model weights.

def rlhf_pipeline_stages() -> dict:
    """
    Standard RLHF pipeline for safety alignment.

    Stage 1: Supervised Fine-Tuning (SFT)
    Stage 2: Reward Model (RM) Training
    Stage 3: PPO/DPO Optimization
    """
    pipeline = {
        "stage_1_sft": {
            "input": "Human-written safe responses to diverse prompts",
            "output": "SFT model that follows instructions",
            "data_size": "50K-500K examples",
            "training_cost": "~5% of pretraining compute",
            "safety_role": "Establishes response format and basic compliance",
        },
        "stage_2_reward_model": {
            "input": "Pairs of responses ranked by human annotators",
            "output": "Reward model R(prompt, response) -> scalar",
            "data_size": "100K-1M comparison pairs",
            "annotation_cost": "$2-10 per comparison",
            "safety_role": "Learns to distinguish safe from unsafe outputs",
            "architecture": "Same as base model + scalar head (1 output)",
        },
        "stage_3_ppo": {
            "input": "Prompts + SFT model + Reward model",
            "output": "Aligned model",
            "objective": "max E[R(x, y)] - beta * KL(pi || pi_sft)",
            "beta": "0.01-0.1 (KL penalty coefficient)",
            "safety_role": "Optimizes model to produce high-reward (safe) outputs",
            "compute_cost": "~10-20% of pretraining compute",
        }
    }
    return pipeline

class RewardModel:
    """Simplified reward model for safety scoring."""

    def __init__(self, base_model, hidden_dim: int):
        self.base_model = base_model
        # Single linear layer: hidden_dim -> 1
        self.reward_head_weight = [0.0] * hidden_dim
        self.reward_head_bias = 0.0

    def forward(self, input_ids: list) -> float:
        """
        Compute reward score for a (prompt, response) pair.

        Higher score = safer, more helpful response.
        Returns scalar reward.
        """
        # Get last hidden state from base model
        hidden_states = self.base_model.forward(input_ids)
        last_hidden = hidden_states[-1]  # [hidden_dim]

        # Linear projection to scalar
        reward = sum(w * h for w, h in zip(self.reward_head_weight, last_hidden))
        reward += self.reward_head_bias

        return reward

    def compute_loss(self, chosen_reward: float, rejected_reward: float) -> float:
        """
        Bradley-Terry loss: chosen response should score higher.
        L = -log(sigmoid(r_chosen - r_rejected))
        """
        import math
        diff = chosen_reward - rejected_reward
        loss = -math.log(1 / (1 + math.exp(-diff)))
        return loss

ℹ️ Note

The reward model is typically the same size as the policy model. Llama 2’s 70B chat model used a 70B reward model. This means RLHF training requires 4 models in memory simultaneously: policy, reference policy, reward model, and value model. For a 70B model, this requires roughly 560GB of GPU memory in FP16.

Direct Preference Optimization (DPO)

DPO eliminates the reward model entirely, reducing the safety training pipeline from 3 stages to 2.

import math

def dpo_loss(
    pi_logprob_chosen: float,
    pi_logprob_rejected: float,
    ref_logprob_chosen: float,
    ref_logprob_rejected: float,
    beta: float = 0.1
) -> float:
    """
    DPO loss function.

    L_DPO = -log sigmoid(beta * (log(pi(y_w|x)/ref(y_w|x)) - log(pi(y_l|x)/ref(y_l|x))))

    pi: policy model log probabilities
    ref: reference (SFT) model log probabilities
    y_w: chosen (winning) response
    y_l: rejected (losing) response
    beta: temperature parameter
    """
    # Log ratios
    chosen_ratio = pi_logprob_chosen - ref_logprob_chosen
    rejected_ratio = pi_logprob_rejected - ref_logprob_rejected

    # DPO implicit reward difference
    reward_diff = beta * (chosen_ratio - rejected_ratio)

    # Binary cross-entropy
    loss = -math.log(1 / (1 + math.exp(-reward_diff)))

    return loss

def dpo_vs_rlhf_comparison() -> dict:
    """Compare DPO and RLHF for safety alignment."""
    return {
        "dpo_advantages": [
            "No reward model needed (saves 50% memory)",
            "No PPO instability (value function estimation)",
            "Simpler hyperparameter tuning (just beta)",
            "2x faster training (no RM forward pass)",
        ],
        "dpo_disadvantages": [
            "Cannot iteratively improve (offline algorithm)",
            "Sensitive to data distribution mismatch",
            "May overfit to preference dataset",
            "Harder to scale safety categories independently",
        ],
        "rlhf_advantages": [
            "Online algorithm: generates new data during training",
            "Can balance multiple reward signals (safety, helpfulness)",
            "More robust to distribution shift",
            "Proven at scale (GPT-4, Claude)",
        ],
        "rlhf_disadvantages": [
            "4 models in memory simultaneously",
            "PPO is unstable, sensitive to hyperparameters",
            "Reward hacking: model exploits reward model weaknesses",
            "Expensive: 10-20% of pretraining compute",
        ],
    }

📊

DPO vs RLHF Training Comparison (70B Model)

Metric	RLHF (PPO)	DPO	Difference
GPU Memory (FP16)	560 GB (4 models)	280 GB (2 models)	-50%
Training Time	~200 GPU-hours	~100 GPU-hours	-50%
Hyperparameters	~15 (PPO + RM)	~5 (beta + LR)	-67%
Safety Robustness	High (online)	Medium (offline)	RLHF wins
Multi-Objective	Easy (reward mixing)	Hard (separate runs)	RLHF wins

Constitutional AI: Self-Supervised Safety

Constitutional AI (CAI) uses the model itself to generate and evaluate safety training data.

def constitutional_ai_pipeline(
    base_model,
    constitution: list,
    num_prompts: int = 10000
) -> dict:
    """
    Constitutional AI pipeline:
    1. Generate responses to red-team prompts
    2. Critique responses using constitutional principles
    3. Revise responses based on critique
    4. Train on (prompt, revised_response) pairs
    """
    # The constitution: a list of principles
    example_constitution = [
        "Choose the response that is least harmful or toxic",
        "Choose the response that is most helpful while being safe",
        "Choose the response that does not encourage illegal activity",
        "Choose the response that avoids stereotypes and bias",
        "Choose the response that is honest about its limitations",
    ]

    # Stage 1: Red-team generation
    red_team_template = """Human: {harmful_prompt}
Assistant: {initial_response}

Critique: Identify any harmful content in the assistant's response.
Revision: Rewrite the response to be helpful while avoiding harm."""

    # Stage 2: AI feedback (RLAIF)
    # Instead of human preferences, use the model + constitution
    comparison_template = """Consider the following two responses:
Response A: {response_a}
Response B: {response_b}

Based on the principle "{principle}", which response is better?"""

    pipeline_stats = {
        "red_team_prompts": num_prompts,
        "critiques_generated": num_prompts,
        "revisions_generated": num_prompts,
        "comparison_pairs": num_prompts * len(constitution),
        "human_annotation_needed": 0,  # Key advantage
        "compute_cost": "~5% of pretraining compute",
    }

    return pipeline_stats

⚡ Performance

Constitutional AI’s main advantage is cost: it generates safety training data without human annotators. Anthropic reports that CAI produces safety alignment comparable to RLHF at roughly 10x lower annotation cost. The trade-off is that the model’s safety is limited by the model’s own ability to evaluate harm.

System Prompt Enforcement

At inference time, the system prompt is the first line of defense. How it is implemented affects both safety and performance.

def system_prompt_implementation() -> dict:
    """
    System prompt enforcement mechanisms and their robustness.
    """
    mechanisms = {
        "prefix_injection": {
            "method": "Prepend system prompt to every conversation",
            "implementation": "Concatenate system tokens before user tokens",
            "kv_cache_impact": "System prompt KV is cached and reused",
            "bypass_difficulty": "Low — prompt injection can override",
            "latency_overhead_ms": 0,  # After initial prefill
            "code": """
def apply_system_prompt(system: str, user: str, model_template: str) -> str:
    # Llama 3 format
    formatted = (
        f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n\\n"
        f"{system}<|eot_id|>"
        f"<|start_header_id|>user<|end_header_id|>\\n\\n"
        f"{user}<|eot_id|>"
        f"<|start_header_id|>assistant<|end_header_id|>\\n\\n"
    )
    return formatted
"""
        },
        "attention_masking": {
            "method": "System prompt attends to all, user cannot modify system",
            "implementation": "Custom attention mask: system tokens are read-only",
            "bypass_difficulty": "Medium — requires architectural support",
            "latency_overhead_ms": 0.5,  # Mask computation
        },
        "hierarchical_instruction": {
            "method": "System prompt given higher priority via training",
            "implementation": "During RLHF, train model to prioritize system over user",
            "bypass_difficulty": "High — embedded in weights",
            "latency_overhead_ms": 0,
        }
    }
    return mechanisms

def system_prompt_kv_cache_optimization(
    system_prompt_tokens: int,
    num_concurrent_requests: int,
    kv_bytes_per_token: int
) -> dict:
    """
    System prompt KV cache can be shared across requests
    with the same system prompt (prefix caching).
    """
    per_request_kv = system_prompt_tokens * kv_bytes_per_token
    naive_total = per_request_kv * num_concurrent_requests
    shared_total = per_request_kv  # Only one copy needed

    return {
        "naive_memory_gb": naive_total / 1e9,
        "shared_memory_gb": shared_total / 1e9,
        "memory_savings_gb": (naive_total - shared_total) / 1e9,
        "savings_pct": (1 - shared_total / naive_total) * 100
    }

# Example: 500-token system prompt, 256 concurrent requests, Llama 70B
result = system_prompt_kv_cache_optimization(500, 256, 327680 // 4096)

📊

System Prompt Enforcement Mechanisms

Mechanism	Bypass Difficulty	Latency Overhead	Implementation Complexity
Prefix Injection	Low	0 ms	Simple
Attention Masking	Medium	0.5 ms	Moderate
Hierarchical Training	High	0 ms	Complex (RLHF)
Prefix Caching + Injection	Low	0 ms (cached)	Moderate

Output Classifiers: Post-Generation Filtering

Output classifiers run after generation to detect and filter unsafe content.

class SafetyClassifier:
    """
    Lightweight classifier that runs on model output.
    Typically a small transformer (100M-1B params) fine-tuned
    for multi-label safety classification.
    """

    CATEGORIES = [
        "hate_speech",
        "violence",
        "sexual_content",
        "self_harm",
        "illegal_activity",
        "personal_information",
        "misinformation",
        "copyright_violation",
    ]

    def __init__(self, model_size_params: int, hidden_dim: int):
        self.model_size = model_size_params
        self.hidden_dim = hidden_dim
        self.num_categories = len(self.CATEGORIES)
        # Multi-label classification head
        self.classifier_params = hidden_dim * self.num_categories

    def classify(self, text_tokens: list) -> dict:
        """
        Run safety classification on generated text.
        Returns per-category probability.
        """
        # Forward pass through small transformer
        # hidden = self.model.forward(text_tokens)[-1]
        # logits = self.classifier_head(hidden)  # [num_categories]
        # probs = sigmoid(logits)

        # Simulated output
        results = {}
        for cat in self.CATEGORIES:
            results[cat] = 0.01  # placeholder probability
        return results

    def should_block(self, probs: dict, thresholds: dict) -> tuple:
        """
        Determine if output should be blocked.
        Returns (should_block, triggered_categories).
        """
        triggered = []
        for category, prob in probs.items():
            threshold = thresholds.get(category, 0.5)
            if prob > threshold:
                triggered.append((category, prob))

        return len(triggered) > 0, triggered

    def latency_estimate(self, num_tokens: int) -> dict:
        """
        Estimate classification latency.
        Small classifiers (100M-300M) on a single GPU.
        """
        # Rough estimate: 0.5ms per 100 tokens for a 300M classifier
        latency_ms = (num_tokens / 100) * 0.5
        return {
            "classification_latency_ms": latency_ms,
            "model_size_gb": self.model_size * 2 / 1e9,  # FP16
            "gpu_memory_overhead_gb": self.model_size * 2 / 1e9,
            "throughput_classifications_per_sec": 1000 / latency_ms
        }

# Llama Guard: Meta's official safety classifier
llama_guard_config = {
    "model": "Llama Guard 3 8B",
    "base": "Llama 3.1 8B",
    "categories": 13,  # S1-S13 taxonomy
    "input_format": "Conversation (multi-turn)",
    "output_format": "safe/unsafe + category label",
    "latency_per_classification": "15-30ms (8B model on A100)",
    "memory_overhead_gb": 16,  # FP16
}

Safety Classifier Latency vs Model Size

100M classifier

300M classifier

1B classifier

Llama Guard 8B

⚠️ Warning

Output classifiers add latency to every request. For streaming responses, classification must run on partial outputs, which reduces accuracy. Llama Guard 3 (8B) adds 15-30ms per classification on an A100, which is acceptable for non-streaming but noticeable for streaming with <50ms TPOT targets.

Input Filtering: Pre-Generation Defense

Input filters detect and block harmful prompts before generation begins.

class InputFilter:
    """
    Multi-stage input filtering pipeline.
    Runs BEFORE the main model generates any tokens.
    """

    def __init__(self):
        self.stages = [
            self.keyword_filter,
            self.pattern_filter,
            self.embedding_classifier,
            self.llm_judge,
        ]

    def keyword_filter(self, text: str) -> dict:
        """
        Stage 1: Fast keyword/regex matching.
        Latency: less than 0.1ms.
        False positive rate: high (5-10%).
        """
        # Blocklist of known harmful patterns
        # This is the fastest but least accurate stage
        return {"passed": True, "latency_ms": 0.05, "stage": "keyword"}

    def pattern_filter(self, text: str) -> dict:
        """
        Stage 2: Regex pattern matching for injection attacks.
        Detects common prompt injection patterns.
        Latency: less than 0.5ms.
        """
        injection_patterns = [
            r"ignore previous instructions",
            r"you are now",
            r"pretend you are",
            r"jailbreak",
            r"DAN mode",
            r"\[system\].*\[/system\]",  # Fake system prompts
        ]
        return {"passed": True, "latency_ms": 0.3, "stage": "pattern"}

    def embedding_classifier(self, text: str) -> dict:
        """
        Stage 3: Embedding-based classifier.
        Small model (50M-100M) maps input to safety score.
        Latency: 1-3ms.
        False positive rate: 1-2%.
        """
        return {"passed": True, "latency_ms": 2.0, "stage": "embedding"}

    def llm_judge(self, text: str) -> dict:
        """
        Stage 4: Use a small LLM to judge input safety.
        Most accurate but most expensive.
        Latency: 10-50ms.
        Only triggered if earlier stages flag uncertainty.
        """
        return {"passed": True, "latency_ms": 30.0, "stage": "llm_judge"}

    def filter_pipeline(self, text: str) -> dict:
        """
        Run stages in order. Early exit on clear pass/fail.
        Only escalate to expensive stages on uncertainty.
        """
        total_latency = 0
        for stage_fn in self.stages:
            result = stage_fn(text)
            total_latency += result["latency_ms"]

            if not result["passed"]:
                return {"blocked": True, "stage": result["stage"],
                        "total_latency_ms": total_latency}

        return {"blocked": False, "total_latency_ms": total_latency}

📊

Input Filter Pipeline Latency

Stage	Latency	Accuracy	When Used
Keyword/Regex	0.1 ms	Low (high FP)	Always
Pattern Matching	0.3 ms	Medium	Always
Embedding Classifier	2 ms	High	Always
LLM Judge	30 ms	Highest	On uncertainty only
Full Pipeline (typical)	2.4 ms	High	Combined

Serving-Layer Safety Architecture

The serving layer coordinates all safety components and manages the overall safety pipeline.

class SafetyServingLayer:
    """
    Orchestrates input filtering, model generation,
    and output classification in a serving pipeline.
    """

    def __init__(self, config: dict):
        self.input_filter = InputFilter()
        self.output_classifier = SafetyClassifier(
            model_size_params=config.get("classifier_size", 300_000_000),
            hidden_dim=config.get("classifier_hidden", 1024)
        )
        self.thresholds = config.get("safety_thresholds", {
            "hate_speech": 0.3,
            "violence": 0.5,
            "sexual_content": 0.3,
            "self_harm": 0.2,
            "illegal_activity": 0.3,
        })

    def serve_request(self, prompt: str, system_prompt: str) -> dict:
        """
        Full safety pipeline for one request.
        """
        metrics = {"stages": []}

        # Stage 1: Input filtering
        input_result = self.input_filter.filter_pipeline(prompt)
        metrics["stages"].append({
            "name": "input_filter",
            "latency_ms": input_result["total_latency_ms"]
        })

        if input_result["blocked"]:
            return {
                "response": "I cannot help with that request.",
                "blocked": True,
                "blocked_stage": "input",
                "metrics": metrics
            }

        # Stage 2: Model generation (main LLM)
        # formatted = apply_system_prompt(system_prompt, prompt)
        # response_tokens = model.generate(formatted)
        generation_latency_ms = 500  # placeholder
        metrics["stages"].append({
            "name": "generation",
            "latency_ms": generation_latency_ms
        })

        # Stage 3: Output classification
        response_text = "Generated response placeholder"
        probs = self.output_classifier.classify([])
        should_block, triggered = self.output_classifier.should_block(
            probs, self.thresholds
        )
        metrics["stages"].append({
            "name": "output_classifier",
            "latency_ms": 5.0
        })

        if should_block:
            return {
                "response": "I cannot provide that information.",
                "blocked": True,
                "blocked_stage": "output",
                "triggered_categories": [t[0] for t in triggered],
                "metrics": metrics
            }

        total_latency = sum(s["latency_ms"] for s in metrics["stages"])
        metrics["total_latency_ms"] = total_latency
        metrics["safety_overhead_ms"] = total_latency - generation_latency_ms
        metrics["safety_overhead_pct"] = (
            (total_latency - generation_latency_ms) / total_latency * 100
        )

        return {
            "response": response_text,
            "blocked": False,
            "metrics": metrics
        }

Safety Overhead as % of Total Latency

Short response (100 tok)

8.5

Medium response (500 tok)

2.1

Long response (2000 tok)

0.6

Robustness: Jailbreak Attack Taxonomy

Understanding attack vectors is essential for designing robust safety architectures.

def jailbreak_taxonomy() -> dict:
    """
    Classification of known jailbreak techniques
    and which safety layers defend against each.
    """
    attacks = {
        "direct_request": {
            "description": "Directly asking for harmful content",
            "difficulty": "Easy to detect",
            "defended_by": ["RLHF alignment", "Input keyword filter"],
            "success_rate_aligned_model": "less than 1%"
        },
        "role_playing": {
            "description": "Ask model to play a character without restrictions",
            "examples": ["DAN (Do Anything Now)", "Grandma bedtime story"],
            "defended_by": ["RLHF with role-play adversarial data", "Output classifier"],
            "success_rate_aligned_model": "5-15%"
        },
        "prompt_injection": {
            "description": "Inject instructions that override system prompt",
            "examples": ["Ignore previous instructions", "New system prompt:"],
            "defended_by": ["Pattern filter", "Hierarchical instruction training"],
            "success_rate_aligned_model": "5-20%"
        },
        "encoding_obfuscation": {
            "description": "Encode harmful request in base64, ROT13, pig latin",
            "defended_by": ["Input normalization", "Multi-encoding input filter"],
            "success_rate_aligned_model": "10-30%"
        },
        "multi_turn_escalation": {
            "description": "Gradually escalate across conversation turns",
            "defended_by": ["Conversation-level classifier", "Turn-by-turn output filter"],
            "success_rate_aligned_model": "15-30%"
        },
        "adversarial_suffix": {
            "description": "Append optimized token sequences (GCG attack)",
            "defended_by": ["Perplexity filter", "Input anomaly detection"],
            "success_rate_aligned_model": "20-60% (without defenses)"
        }
    }
    return attacks

📊

Jailbreak Success Rates vs Defense Layers

Attack Type	No Safety	RLHF Only	RLHF + Classifier	Full Pipeline
Direct Request	100%	1%	0.1%	0.01%
Role Playing	100%	15%	5%	2%
Prompt Injection	100%	20%	8%	3%
Encoding Obfuscation	100%	30%	10%	3%
Multi-Turn Escalation	100%	30%	12%	5%
Adversarial Suffix (GCG)	100%	60%	15%	5%

Performance Cost of Safety

Every safety layer adds latency and compute cost. Quantifying this trade-off is essential for production deployment.

def safety_performance_budget(
    model_latency_ms: float,
    input_filter_ms: float = 2.5,
    output_classifier_ms: float = 5.0,
    llm_guard_ms: float = 25.0,
    use_llm_guard: bool = False
) -> dict:
    """
    Calculate total safety overhead.
    """
    safety_latency = input_filter_ms + output_classifier_ms
    if use_llm_guard:
        safety_latency += llm_guard_ms

    total_latency = model_latency_ms + safety_latency
    overhead_pct = (safety_latency / total_latency) * 100

    # GPU cost: classifier needs its own GPU (or shares with model)
    # 300M classifier on shared GPU: ~0.6 GB overhead
    # 8B Llama Guard on dedicated GPU: 16 GB dedicated
    gpu_overhead = {
        "shared_classifier_gb": 0.6,
        "llama_guard_dedicated_gb": 16.0 if use_llm_guard else 0,
    }

    # Throughput impact: safety pipeline can be parallelized
    # Input filter runs during previous request's generation
    # Output classifier can overlap with next request's prefill
    pipelined_overhead_ms = max(input_filter_ms, output_classifier_ms) * 0.3

    return {
        "safety_latency_ms": safety_latency,
        "total_latency_ms": total_latency,
        "overhead_pct": overhead_pct,
        "pipelined_overhead_ms": pipelined_overhead_ms,
        "gpu_overhead": gpu_overhead
    }

# Typical production scenario
for model_latency in [50, 200, 1000, 5000]:
    result = safety_performance_budget(model_latency)
    print(f"Model={model_latency}ms: safety={result['safety_latency_ms']:.1f}ms "
          f"({result['overhead_pct']:.1f}% overhead)")

Safety Overhead Relative to Model Latency

50ms model (short)

200ms model

3.6

1000ms model

0.7

5000ms model (long)

0.15

💡 Tip

Safety overhead is inversely proportional to generation time. For short responses (chatbot-style, <100 tokens), safety adds 5-15% latency. For long responses (code generation, analysis), safety overhead is negligible (<1%). Design your safety pipeline with your typical response length in mind.

Multi-Model Safety Architectures

Production systems often use multiple models in the safety pipeline.

def multi_model_safety_architecture() -> dict:
    """
    Production safety architecture with specialized models.
    """
    architecture = {
        "input_classifier": {
            "model": "Custom 300M safety classifier",
            "gpu": "Shared with main model (0.6 GB)",
            "role": "Fast input screening",
            "latency": "2ms",
        },
        "main_model": {
            "model": "Llama 3.1 70B (RLHF-aligned)",
            "gpu": "8x A100-80GB (TP=8)",
            "role": "Generate response",
            "latency": "200-2000ms",
        },
        "output_classifier": {
            "model": "Llama Guard 3 8B",
            "gpu": "1x A100-80GB (dedicated)",
            "role": "Classify output safety",
            "latency": "15-30ms",
        },
        "fallback_model": {
            "model": "Llama 3.1 8B (extra-safe fine-tune)",
            "gpu": "Shared with output classifier",
            "role": "Generate safe fallback if main output blocked",
            "latency": "50-200ms",
        },
        "total_gpu_cost": "9x A100-80GB",
        "safety_gpu_fraction": "1/9 = 11%",
    }
    return architecture

📊

Multi-Model Safety Pipeline Resource Usage

Component	Model Size	GPU Allocation	Latency
Input Classifier	300M	Shared (0.6 GB)	2 ms
Main Model (70B)	70B	8x A100 TP=8	200-2000 ms
Llama Guard 3	8B	1x A100 dedicated	15-30 ms
Fallback Model	8B	Shared with Guard	50-200 ms
Total Pipeline	-	9x A100-80GB	220-2050 ms

The safety architecture of frontier models is a multi-layered system where each layer trades off latency for accuracy. Training-time alignment (RLHF/DPO) provides the foundation with zero inference overhead. Inference-time classifiers add 5-30ms but catch edge cases that alignment misses. Serving-layer filters add another 2-3ms for known attack patterns. The total safety overhead for a well-designed pipeline is 1-15% of total latency, depending on response length. The engineering challenge is not implementing any single layer, but orchestrating all layers to minimize latency while maximizing coverage across the attack surface.