Part of Series NVIDIA Dynamo & llm-d 18 of 30
1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

Running multiple tenants on the same GPU cluster saves 40-60% on infrastructure costs compared to dedicated per-tenant deployments. A Llama 70B model loaded once on 4xH100 can serve requests from 50 different customers simultaneously. But shared infrastructure creates security risks: one tenant’s prompt might leak into another tenant’s KV cache. One tenant’s traffic spike might degrade another tenant’s latency. One tenant’s request logs might be visible to another tenant’s administrator.

Dynamo addresses multi-tenancy through four layers of isolation: software-level request isolation (separate KV caches per tenant), hardware-level GPU partitioning (NVIDIA MIG), network-level segmentation (per-tenant VLANs and mTLS), and audit-level compliance (per-request logging with tenant attribution). This post covers the implementation of each layer.

The Multi-Tenancy Threat Model

What Can Go Wrong

from dataclasses import dataclass
from enum import Enum

class ThreatCategory(Enum):
    KV_CACHE_LEAKAGE = "kv_cache_leakage"
    PROMPT_INJECTION = "prompt_injection"
    RESOURCE_STARVATION = "resource_starvation"
    DATA_EXFILTRATION = "data_exfiltration"
    SIDE_CHANNEL = "side_channel"

@dataclass
class ThreatScenario:
    category: ThreatCategory
    description: str
    severity: str  # "critical", "high", "medium", "low"
    mitigation: str

THREAT_MODEL = [
    ThreatScenario(
        category=ThreatCategory.KV_CACHE_LEAKAGE,
        description=(
            "Tenant A's KV cache entries are read by Tenant B's decode step. "
            "This could expose Tenant A's prompt content to Tenant B. "
            "Most likely cause: bug in KV cache indexing that maps to wrong tenant."
        ),
        severity="critical",
        mitigation="Per-tenant KV cache namespacing + memory access guards",
    ),
    ThreatScenario(
        category=ThreatCategory.PROMPT_INJECTION,
        description=(
            "A malicious prompt causes the model to reveal information "
            "from other tenants' system prompts stored in KV cache. "
            "Especially dangerous with prefix caching where system prompts "
            "might be shared across tenants."
        ),
        severity="critical",
        mitigation="Never share KV cache across tenants, even for identical prefixes",
    ),
    ThreatScenario(
        category=ThreatCategory.RESOURCE_STARVATION,
        description=(
            "Tenant A sends a burst of long-context requests that consume "
            "all GPU memory and batch slots, causing Tenant B's requests "
            "to timeout or be rejected."
        ),
        severity="high",
        mitigation="Per-tenant rate limits + resource quotas + fair scheduling",
    ),
    ThreatScenario(
        category=ThreatCategory.DATA_EXFILTRATION,
        description=(
            "Model outputs from Tenant A's requests are logged in a shared "
            "logging system accessible to Tenant B's administrators."
        ),
        severity="high",
        mitigation="Per-tenant log partitioning + encryption at rest",
    ),
    ThreatScenario(
        category=ThreatCategory.SIDE_CHANNEL,
        description=(
            "Tenant B measures latency variations to infer information "
            "about Tenant A's request volume or prompt lengths. "
            "Timing side channels are difficult to fully eliminate."
        ),
        severity="medium",
        mitigation="Add latency noise + dedicated GPU partitions (MIG)",
    ),
]
⚠️ Warning

KV cache leakage is the most critical threat in multi-tenant LLM serving. Unlike CPU-based services where memory isolation is enforced by the OS, GPU memory is a shared flat address space within a process. All tenants’ KV caches exist in the same GPU memory, separated only by software-level indexing. A single off-by-one error in the cache index could expose one tenant’s prompt to another.

Software-Level Request Isolation

Per-Tenant KV Cache Namespacing

import hashlib
import secrets
from collections import defaultdict

class TenantIsolatedKVCache:
    """
    KV cache manager with strict per-tenant isolation.

    Each tenant's KV cache entries are namespaced by tenant_id.
    Cross-tenant access is impossible by construction:
    cache keys include the tenant_id as a prefix.
    """

    def __init__(self, total_gpu_memory_bytes, per_tenant_quota_fraction=0.1):
        self.total_memory = total_gpu_memory_bytes
        self.per_tenant_quota = int(total_gpu_memory_bytes * per_tenant_quota_fraction)
        self.tenant_usage = defaultdict(int)  # tenant_id -> bytes used
        self.cache = {}  # (tenant_id, prefix_hash) -> KV cache entry

    def _make_key(self, tenant_id, prefix_hash):
        """
        Create a namespaced cache key.
        The key ALWAYS includes tenant_id to prevent cross-tenant access.
        """
        return f"{tenant_id}:{prefix_hash}"

    def store(self, tenant_id, prefix_hash, kv_data, size_bytes):
        """
        Store KV cache entry for a specific tenant.
        Enforces per-tenant memory quota.
        """
        # Check quota
        if self.tenant_usage[tenant_id] + size_bytes > self.per_tenant_quota:
            # Evict oldest entries for this tenant
            self._evict_tenant_entries(tenant_id, size_bytes)

        key = self._make_key(tenant_id, prefix_hash)
        self.cache[key] = {
            'data': kv_data,
            'size_bytes': size_bytes,
            'tenant_id': tenant_id,
            'created_at': time.time(),
        }
        self.tenant_usage[tenant_id] += size_bytes

    def lookup(self, tenant_id, prefix_hash):
        """
        Look up KV cache entry.
        ONLY returns entries belonging to the requesting tenant.
        """
        key = self._make_key(tenant_id, prefix_hash)
        entry = self.cache.get(key)

        if entry is None:
            return None

        # Defense in depth: verify tenant_id matches
        assert entry['tenant_id'] == tenant_id, \
            f"SECURITY VIOLATION: cache entry tenant {entry['tenant_id']} " \
            f"!= requesting tenant {tenant_id}"

        return entry['data']

    def _evict_tenant_entries(self, tenant_id, needed_bytes):
        """Evict oldest entries from a specific tenant's cache."""
        tenant_entries = [
            (key, entry) for key, entry in self.cache.items()
            if entry['tenant_id'] == tenant_id
        ]
        tenant_entries.sort(key=lambda x: x[1]['created_at'])

        freed = 0
        for key, entry in tenant_entries:
            if freed >= needed_bytes:
                break
            del self.cache[key]
            self.tenant_usage[tenant_id] -= entry['size_bytes']
            freed += entry['size_bytes']

    def get_tenant_stats(self, tenant_id):
        """Get memory usage stats for a tenant."""
        return {
            'tenant_id': tenant_id,
            'memory_used_bytes': self.tenant_usage.get(tenant_id, 0),
            'memory_quota_bytes': self.per_tenant_quota,
            'utilization': self.tenant_usage.get(tenant_id, 0) / self.per_tenant_quota,
            'num_cached_entries': sum(
                1 for entry in self.cache.values()
                if entry['tenant_id'] == tenant_id
            ),
        }

Request-Level Isolation

class RequestIsolationMiddleware:
    """
    Middleware that ensures every request is tagged with a tenant_id
    and that no cross-tenant data flows occur.
    """

    def __init__(self, auth_provider):
        self.auth = auth_provider

    async def process_request(self, raw_request):
        """
        Validate and tag every request with tenant context.
        This runs before any processing.
        """
        # Extract and validate tenant identity
        api_key = raw_request.headers.get('Authorization', '').replace('Bearer ', '')
        tenant = await self.auth.validate_and_get_tenant(api_key)

        if tenant is None:
            return {'error': 'invalid_api_key'}, 401

        # Create isolated request context
        context = RequestContext(
            request_id=self._generate_request_id(),
            tenant_id=tenant.id,
            tenant_tier=tenant.tier,
            rate_limit=tenant.rate_limit,
            max_tokens_per_request=tenant.max_tokens,
            allowed_models=tenant.allowed_models,
        )

        # Tag the request
        raw_request.context = context

        # Validate model access
        requested_model = raw_request.body.get('model', '')
        if requested_model not in tenant.allowed_models:
            return {'error': f'model {requested_model} not in allowed list'}, 403

        return raw_request, 200

    def _generate_request_id(self):
        """Generate a cryptographically random request ID."""
        return f"req_{secrets.token_hex(16)}"

@dataclass
class RequestContext:
    request_id: str
    tenant_id: str
    tenant_tier: str
    rate_limit: float  # Requests per second
    max_tokens_per_request: int
    allowed_models: list

Per-Tenant Rate Limiting and Fair Scheduling

Token Bucket Rate Limiter

import threading

class TenantRateLimiter:
    """
    Per-tenant rate limiting using token bucket algorithm.
    Prevents any single tenant from starving others.
    """

    def __init__(self):
        self.buckets = {}
        self.lock = threading.Lock()

    def configure_tenant(self, tenant_id, requests_per_second, burst_size):
        """Configure rate limit for a tenant."""
        with self.lock:
            self.buckets[tenant_id] = {
                'rate': requests_per_second,
                'burst': burst_size,
                'tokens': burst_size,  # Start full
                'last_refill': time.time(),
            }

    def allow_request(self, tenant_id):
        """Check if a request from this tenant should be allowed."""
        with self.lock:
            bucket = self.buckets.get(tenant_id)
            if bucket is None:
                return False  # Unknown tenant

            # Refill tokens
            now = time.time()
            elapsed = now - bucket['last_refill']
            bucket['tokens'] = min(
                bucket['burst'],
                bucket['tokens'] + elapsed * bucket['rate']
            )
            bucket['last_refill'] = now

            # Check and consume
            if bucket['tokens'] >= 1.0:
                bucket['tokens'] -= 1.0
                return True
            return False

    def get_remaining(self, tenant_id):
        """Get remaining rate limit tokens for a tenant."""
        bucket = self.buckets.get(tenant_id)
        if bucket is None:
            return 0
        return int(bucket['tokens'])

class FairScheduler:
    """
    Fair scheduler that ensures each tenant gets proportional GPU time.
    Uses Weighted Fair Queuing (WFQ).
    """

    def __init__(self, tenant_weights):
        """
        Args:
            tenant_weights: Dict of tenant_id -> weight.
                           Weight determines share of GPU time.
        """
        self.weights = tenant_weights
        self.virtual_time = defaultdict(float)  # Per-tenant virtual finish time
        self.queues = defaultdict(list)

    def enqueue(self, request, tenant_id):
        """Add a request to the tenant's queue with virtual timestamp."""
        weight = self.weights.get(tenant_id, 1.0)
        cost = request.get('estimated_tokens', 1000) / weight

        # Virtual finish time = max(current virtual time, last finish time) + cost
        vft = max(
            self._global_virtual_time(),
            self.virtual_time[tenant_id]
        ) + cost

        self.virtual_time[tenant_id] = vft
        self.queues[tenant_id].append((vft, request))

    def dequeue_batch(self, batch_size):
        """
        Select next batch of requests using weighted fair queuing.
        Requests with lowest virtual finish time go first.
        """
        # Merge all queues and sort by virtual finish time
        all_requests = []
        for tenant_id, queue in self.queues.items():
            for vft, request in queue:
                all_requests.append((vft, tenant_id, request))

        all_requests.sort(key=lambda x: x[0])

        batch = []
        selected_per_tenant = defaultdict(int)

        for vft, tenant_id, request in all_requests:
            if len(batch) >= batch_size:
                break
            batch.append(request)
            selected_per_tenant[tenant_id] += 1

        # Remove selected from queues
        for tenant_id, count in selected_per_tenant.items():
            self.queues[tenant_id] = self.queues[tenant_id][count:]

        return batch

    def _global_virtual_time(self):
        """Minimum virtual finish time across all tenants."""
        if not self.virtual_time:
            return 0.0
        return min(self.virtual_time.values())
📊

Fair Scheduling Impact on Multi-Tenant Latency

ScenarioTenant A P99 TTFTTenant B P99 TTFTFairness Index
No isolation (FIFO) 120ms 1,850ms 0.32
Rate limiting only 180ms 450ms 0.71
Fair scheduling (equal weight) 220ms 235ms 0.97
Fair scheduling (2:1 weight) 170ms 290ms 0.89
Dedicated GPUs (MIG) 180ms 185ms 0.99
Note: Scenario: Tenant A sends 10 QPS (normal), Tenant B sends 200 QPS (burst). Without fair scheduling, Tenant A's P99 TTFT degrades 15x. Fair scheduling keeps Tenant A within 2x of its normal latency. MIG provides near-perfect isolation at the cost of GPU efficiency.

Hardware Isolation with NVIDIA MIG

Multi-Instance GPU

MIG (Multi-Instance GPU) partitions a single GPU into up to 7 isolated instances. Each instance has dedicated compute units, memory, and memory bandwidth. One MIG instance cannot access another’s memory, providing hardware-enforced isolation.

class MIGManager:
    """
    Manage NVIDIA Multi-Instance GPU partitions for tenant isolation.

    MIG partitions available on H100:
    - 1g.10gb: 1/7 of GPU (10GB, 16.5 SMs)
    - 2g.20gb: 2/7 of GPU (20GB, 33 SMs)
    - 3g.40gb: 3/7 of GPU (40GB, 49.5 SMs)
    - 4g.40gb: 4/7 of GPU (40GB, 66 SMs)
    - 7g.80gb: Full GPU (80GB, all SMs)
    """

    MIG_PROFILES = {
        'h100': {
            '1g.10gb': {'memory_gb': 10, 'sm_count': 16, 'fraction': 1/7},
            '2g.20gb': {'memory_gb': 20, 'sm_count': 33, 'fraction': 2/7},
            '3g.40gb': {'memory_gb': 40, 'sm_count': 49, 'fraction': 3/7},
            '4g.40gb': {'memory_gb': 40, 'sm_count': 66, 'fraction': 4/7},
            '7g.80gb': {'memory_gb': 80, 'sm_count': 132, 'fraction': 7/7},
        }
    }

    def __init__(self, gpu_type='h100'):
        self.gpu_type = gpu_type
        self.profiles = self.MIG_PROFILES[gpu_type]

    def plan_partitions(self, tenant_requirements):
        """
        Plan MIG partition allocation for a set of tenants.

        Args:
            tenant_requirements: Dict of tenant_id -> {
                'model_memory_gb': required GPU memory,
                'throughput_fraction': fraction of GPU needed,
            }
        """
        allocations = {}
        remaining_fractions = 1.0

        # Sort tenants by requirement (largest first for better packing)
        sorted_tenants = sorted(
            tenant_requirements.items(),
            key=lambda x: x[1]['throughput_fraction'],
            reverse=True,
        )

        for tenant_id, req in sorted_tenants:
            # Find smallest profile that fits
            best_profile = None
            for profile_name, spec in sorted(
                self.profiles.items(), key=lambda x: x[1]['fraction']
            ):
                if (spec['memory_gb'] >= req['model_memory_gb'] and
                        spec['fraction'] <= remaining_fractions):
                    best_profile = profile_name
                    break

            if best_profile:
                allocations[tenant_id] = {
                    'profile': best_profile,
                    'spec': self.profiles[best_profile],
                }
                remaining_fractions -= self.profiles[best_profile]['fraction']
            else:
                allocations[tenant_id] = {
                    'profile': None,
                    'error': 'No suitable MIG partition available',
                }

        return allocations

    def apply_partitions(self, gpu_id, allocations):
        """
        Apply MIG partitions to a physical GPU.
        Returns shell commands to execute.
        """
        commands = []

        # Enable MIG mode
        commands.append(f"sudo nvidia-smi -i {gpu_id} -mig 1")

        # Create instances
        for tenant_id, alloc in allocations.items():
            if alloc['profile'] is None:
                continue

            profile = alloc['profile']
            # Map profile name to nvidia-smi profile ID
            profile_map = {
                '1g.10gb': '19', '2g.20gb': '14', '3g.40gb': '9',
                '4g.40gb': '5', '7g.80gb': '0',
            }
            profile_id = profile_map.get(profile, '19')

            commands.append(
                f"sudo nvidia-smi mig -i {gpu_id} "
                f"-cgi {profile_id} -C"
            )

        return commands
ℹ️ Note

MIG provides the strongest isolation guarantee but reduces GPU efficiency. A 3g.40gb MIG instance has 40GB memory (half of H100’s 80GB) and 49.5 SMs (37.5% of H100’s 132 SMs). The memory/compute ratio is worse than the full GPU. For LLM serving, where batch size is limited by memory, MIG instances achieve 70-85% of their theoretical throughput. Factor this into capacity planning.

Network Isolation

Per-Tenant Network Segmentation

class NetworkIsolationConfig:
    """
    Network isolation for multi-tenant Dynamo deployments.
    """

    def generate_network_policy(self, tenant_id, namespace):
        """
        Generate Kubernetes NetworkPolicy for tenant isolation.
        Each tenant's requests flow through dedicated network paths.
        """
        return {
            'apiVersion': 'networking.k8s.io/v1',
            'kind': 'NetworkPolicy',
            'metadata': {
                'name': f'tenant-{tenant_id}-policy',
                'namespace': namespace,
            },
            'spec': {
                'podSelector': {
                    'matchLabels': {
                        'tenant': tenant_id,
                    },
                },
                'policyTypes': ['Ingress', 'Egress'],
                'ingress': [{
                    'from': [
                        # Only allow traffic from tenant's API gateway
                        {
                            'podSelector': {
                                'matchLabels': {
                                    'app': 'dynamo-gateway',
                                    'tenant': tenant_id,
                                },
                            },
                        },
                        # And from Dynamo internal components
                        {
                            'podSelector': {
                                'matchLabels': {
                                    'app': 'dynamo-router',
                                },
                            },
                        },
                    ],
                }],
                'egress': [{
                    'to': [
                        # Only allow traffic to Dynamo workers
                        {
                            'podSelector': {
                                'matchLabels': {
                                    'app': 'dynamo-worker',
                                },
                            },
                        },
                    ],
                }],
            },
        }

    def generate_mtls_config(self, tenant_id):
        """
        Generate mTLS configuration for tenant communication.
        Every request between Dynamo components is encrypted
        with tenant-specific certificates.
        """
        return {
            'tls': {
                'mode': 'STRICT',
                'certificate_chain': f'/certs/{tenant_id}/cert.pem',
                'private_key': f'/certs/{tenant_id}/key.pem',
                'root_ca': f'/certs/{tenant_id}/ca.pem',
            },
            'peer_authentication': {
                'mode': 'STRICT',
                'expected_san': f'spiffe://cluster.local/ns/llm-serving/sa/tenant-{tenant_id}',
            },
        }

Audit Logging

Per-Request Audit Trail

import json
import hashlib
from datetime import datetime

class AuditLogger:
    """
    Comprehensive audit logging for compliance.
    Every request is logged with tenant attribution,
    timing, and content hashes (not content itself).
    """

    def __init__(self, log_backend, encryption_key):
        self.backend = log_backend
        self.encryption_key = encryption_key

    def log_request(self, request, response, metrics):
        """
        Log a complete request-response cycle.

        Logs content hashes, not actual content, to avoid
        storing sensitive data in audit logs.
        """
        audit_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'request_id': request['request_id'],
            'tenant_id': request['tenant_id'],
            'model': request['model'],

            # Content hashes (not actual content)
            'prompt_hash': self._hash_content(
                json.dumps(request.get('messages', []))
            ),
            'response_hash': self._hash_content(
                response.get('text', '')
            ),

            # Metrics
            'input_tokens': metrics.get('input_tokens', 0),
            'output_tokens': metrics.get('output_tokens', 0),
            'ttft_ms': metrics.get('ttft_ms', 0),
            'total_latency_ms': metrics.get('total_latency_ms', 0),

            # Routing
            'worker_id': metrics.get('worker_id', ''),
            'routing_strategy': metrics.get('routing_strategy', ''),
            'kv_cache_hit': metrics.get('kv_cache_hit', False),

            # Status
            'status': 'success' if 'error' not in response else 'error',
            'error_code': response.get('error', {}).get('code', ''),

            # Security
            'source_ip': request.get('source_ip', ''),
            'api_key_prefix': request.get('api_key', '')[:8] + '...',
        }

        # Encrypt and store
        encrypted = self._encrypt(json.dumps(audit_entry))
        self.backend.write(encrypted)

        return audit_entry['request_id']

    def _hash_content(self, content):
        """Hash content for audit trail without storing actual content."""
        return hashlib.sha256(content.encode()).hexdigest()[:16]

    def _encrypt(self, data):
        """Encrypt audit log entry at rest."""
        # Use AES-256-GCM in production
        # Simplified placeholder
        return data

    def query_tenant_logs(self, tenant_id, start_time, end_time):
        """
        Query audit logs for a specific tenant.
        Tenants can only see their own logs.
        """
        return self.backend.query(
            tenant_id=tenant_id,
            start_time=start_time,
            end_time=end_time,
        )

class ComplianceReporter:
    """
    Generate compliance reports from audit logs.
    """

    def __init__(self, audit_logger):
        self.logger = audit_logger

    def generate_soc2_report(self, tenant_id, period_start, period_end):
        """Generate SOC2-relevant metrics for a tenant."""
        logs = self.logger.query_tenant_logs(
            tenant_id, period_start, period_end
        )

        total_requests = len(logs)
        successful = sum(1 for l in logs if l['status'] == 'success')
        errors = total_requests - successful

        return {
            'tenant_id': tenant_id,
            'period': f"{period_start} to {period_end}",
            'total_requests': total_requests,
            'success_rate': successful / max(total_requests, 1),
            'error_rate': errors / max(total_requests, 1),
            'avg_latency_ms': sum(l['total_latency_ms'] for l in logs) / max(total_requests, 1),
            'total_input_tokens': sum(l['input_tokens'] for l in logs),
            'total_output_tokens': sum(l['output_tokens'] for l in logs),
            'unique_api_keys_used': len(set(l['api_key_prefix'] for l in logs)),
            'data_retention': 'Content not stored; hashes only',
            'encryption': 'AES-256-GCM at rest, TLS 1.3 in transit',
        }

Isolation Levels Summary

📊

Isolation Levels for Multi-Tenant Dynamo

LevelMechanismIsolation StrengthPerformance OverheadCost Overhead
1: Logical Per-tenant KV namespacing + rate limits Medium Less than 1% 0%
2: Scheduling Fair queuing + resource quotas Medium-High 2-5% 0%
3: Process Separate worker processes per tenant High 10-20% 20-30%
4: MIG Hardware GPU partitioning Very High 15-30% 30-50%
5: Dedicated Separate GPU nodes per tenant Maximum 0% 100-400%
Note: Most production deployments use Level 1-2 (logical + scheduling). Regulated industries (healthcare, finance) may require Level 3-4. Level 5 (dedicated) is rarely justified unless contractually required.

Cost vs Isolation Tradeoff

Metric Shared (Level 1-2)Process (Level 3)MIG (Level 4)Dedicated (Level 5)
Relative Cost
1
1.25
1.4
3

Multi-tenant security in Dynamo is a spectrum: from software-level namespacing (cheap, good enough for most use cases) to hardware-level MIG partitioning (expensive, required for regulated industries) to fully dedicated nodes (most expensive, contractually required for some enterprise customers). The right choice depends on the threat model, compliance requirements, and cost tolerance. For most SaaS deployments, Level 1-2 (logical isolation + fair scheduling) provides sufficient security. The critical implementation detail is never sharing KV cache entries across tenants, even when the prefixes are identical — the cost of re-computing a shared system prompt is negligible compared to the risk of cross-tenant information leakage.