Running multiple tenants on the same GPU cluster saves 40-60% on infrastructure costs compared to dedicated per-tenant deployments. A Llama 70B model loaded once on 4xH100 can serve requests from 50 different customers simultaneously. But shared infrastructure creates security risks: one tenant’s prompt might leak into another tenant’s KV cache. One tenant’s traffic spike might degrade another tenant’s latency. One tenant’s request logs might be visible to another tenant’s administrator.
Dynamo addresses multi-tenancy through four layers of isolation: software-level request isolation (separate KV caches per tenant), hardware-level GPU partitioning (NVIDIA MIG), network-level segmentation (per-tenant VLANs and mTLS), and audit-level compliance (per-request logging with tenant attribution). This post covers the implementation of each layer.
The Multi-Tenancy Threat Model
What Can Go Wrong
from dataclasses import dataclass
from enum import Enum
class ThreatCategory(Enum):
KV_CACHE_LEAKAGE = "kv_cache_leakage"
PROMPT_INJECTION = "prompt_injection"
RESOURCE_STARVATION = "resource_starvation"
DATA_EXFILTRATION = "data_exfiltration"
SIDE_CHANNEL = "side_channel"
@dataclass
class ThreatScenario:
category: ThreatCategory
description: str
severity: str # "critical", "high", "medium", "low"
mitigation: str
THREAT_MODEL = [
ThreatScenario(
category=ThreatCategory.KV_CACHE_LEAKAGE,
description=(
"Tenant A's KV cache entries are read by Tenant B's decode step. "
"This could expose Tenant A's prompt content to Tenant B. "
"Most likely cause: bug in KV cache indexing that maps to wrong tenant."
),
severity="critical",
mitigation="Per-tenant KV cache namespacing + memory access guards",
),
ThreatScenario(
category=ThreatCategory.PROMPT_INJECTION,
description=(
"A malicious prompt causes the model to reveal information "
"from other tenants' system prompts stored in KV cache. "
"Especially dangerous with prefix caching where system prompts "
"might be shared across tenants."
),
severity="critical",
mitigation="Never share KV cache across tenants, even for identical prefixes",
),
ThreatScenario(
category=ThreatCategory.RESOURCE_STARVATION,
description=(
"Tenant A sends a burst of long-context requests that consume "
"all GPU memory and batch slots, causing Tenant B's requests "
"to timeout or be rejected."
),
severity="high",
mitigation="Per-tenant rate limits + resource quotas + fair scheduling",
),
ThreatScenario(
category=ThreatCategory.DATA_EXFILTRATION,
description=(
"Model outputs from Tenant A's requests are logged in a shared "
"logging system accessible to Tenant B's administrators."
),
severity="high",
mitigation="Per-tenant log partitioning + encryption at rest",
),
ThreatScenario(
category=ThreatCategory.SIDE_CHANNEL,
description=(
"Tenant B measures latency variations to infer information "
"about Tenant A's request volume or prompt lengths. "
"Timing side channels are difficult to fully eliminate."
),
severity="medium",
mitigation="Add latency noise + dedicated GPU partitions (MIG)",
),
]
KV cache leakage is the most critical threat in multi-tenant LLM serving. Unlike CPU-based services where memory isolation is enforced by the OS, GPU memory is a shared flat address space within a process. All tenants’ KV caches exist in the same GPU memory, separated only by software-level indexing. A single off-by-one error in the cache index could expose one tenant’s prompt to another.
Software-Level Request Isolation
Per-Tenant KV Cache Namespacing
import hashlib
import secrets
from collections import defaultdict
class TenantIsolatedKVCache:
"""
KV cache manager with strict per-tenant isolation.
Each tenant's KV cache entries are namespaced by tenant_id.
Cross-tenant access is impossible by construction:
cache keys include the tenant_id as a prefix.
"""
def __init__(self, total_gpu_memory_bytes, per_tenant_quota_fraction=0.1):
self.total_memory = total_gpu_memory_bytes
self.per_tenant_quota = int(total_gpu_memory_bytes * per_tenant_quota_fraction)
self.tenant_usage = defaultdict(int) # tenant_id -> bytes used
self.cache = {} # (tenant_id, prefix_hash) -> KV cache entry
def _make_key(self, tenant_id, prefix_hash):
"""
Create a namespaced cache key.
The key ALWAYS includes tenant_id to prevent cross-tenant access.
"""
return f"{tenant_id}:{prefix_hash}"
def store(self, tenant_id, prefix_hash, kv_data, size_bytes):
"""
Store KV cache entry for a specific tenant.
Enforces per-tenant memory quota.
"""
# Check quota
if self.tenant_usage[tenant_id] + size_bytes > self.per_tenant_quota:
# Evict oldest entries for this tenant
self._evict_tenant_entries(tenant_id, size_bytes)
key = self._make_key(tenant_id, prefix_hash)
self.cache[key] = {
'data': kv_data,
'size_bytes': size_bytes,
'tenant_id': tenant_id,
'created_at': time.time(),
}
self.tenant_usage[tenant_id] += size_bytes
def lookup(self, tenant_id, prefix_hash):
"""
Look up KV cache entry.
ONLY returns entries belonging to the requesting tenant.
"""
key = self._make_key(tenant_id, prefix_hash)
entry = self.cache.get(key)
if entry is None:
return None
# Defense in depth: verify tenant_id matches
assert entry['tenant_id'] == tenant_id, \
f"SECURITY VIOLATION: cache entry tenant {entry['tenant_id']} " \
f"!= requesting tenant {tenant_id}"
return entry['data']
def _evict_tenant_entries(self, tenant_id, needed_bytes):
"""Evict oldest entries from a specific tenant's cache."""
tenant_entries = [
(key, entry) for key, entry in self.cache.items()
if entry['tenant_id'] == tenant_id
]
tenant_entries.sort(key=lambda x: x[1]['created_at'])
freed = 0
for key, entry in tenant_entries:
if freed >= needed_bytes:
break
del self.cache[key]
self.tenant_usage[tenant_id] -= entry['size_bytes']
freed += entry['size_bytes']
def get_tenant_stats(self, tenant_id):
"""Get memory usage stats for a tenant."""
return {
'tenant_id': tenant_id,
'memory_used_bytes': self.tenant_usage.get(tenant_id, 0),
'memory_quota_bytes': self.per_tenant_quota,
'utilization': self.tenant_usage.get(tenant_id, 0) / self.per_tenant_quota,
'num_cached_entries': sum(
1 for entry in self.cache.values()
if entry['tenant_id'] == tenant_id
),
}
Request-Level Isolation
class RequestIsolationMiddleware:
"""
Middleware that ensures every request is tagged with a tenant_id
and that no cross-tenant data flows occur.
"""
def __init__(self, auth_provider):
self.auth = auth_provider
async def process_request(self, raw_request):
"""
Validate and tag every request with tenant context.
This runs before any processing.
"""
# Extract and validate tenant identity
api_key = raw_request.headers.get('Authorization', '').replace('Bearer ', '')
tenant = await self.auth.validate_and_get_tenant(api_key)
if tenant is None:
return {'error': 'invalid_api_key'}, 401
# Create isolated request context
context = RequestContext(
request_id=self._generate_request_id(),
tenant_id=tenant.id,
tenant_tier=tenant.tier,
rate_limit=tenant.rate_limit,
max_tokens_per_request=tenant.max_tokens,
allowed_models=tenant.allowed_models,
)
# Tag the request
raw_request.context = context
# Validate model access
requested_model = raw_request.body.get('model', '')
if requested_model not in tenant.allowed_models:
return {'error': f'model {requested_model} not in allowed list'}, 403
return raw_request, 200
def _generate_request_id(self):
"""Generate a cryptographically random request ID."""
return f"req_{secrets.token_hex(16)}"
@dataclass
class RequestContext:
request_id: str
tenant_id: str
tenant_tier: str
rate_limit: float # Requests per second
max_tokens_per_request: int
allowed_models: list
Per-Tenant Rate Limiting and Fair Scheduling
Token Bucket Rate Limiter
import threading
class TenantRateLimiter:
"""
Per-tenant rate limiting using token bucket algorithm.
Prevents any single tenant from starving others.
"""
def __init__(self):
self.buckets = {}
self.lock = threading.Lock()
def configure_tenant(self, tenant_id, requests_per_second, burst_size):
"""Configure rate limit for a tenant."""
with self.lock:
self.buckets[tenant_id] = {
'rate': requests_per_second,
'burst': burst_size,
'tokens': burst_size, # Start full
'last_refill': time.time(),
}
def allow_request(self, tenant_id):
"""Check if a request from this tenant should be allowed."""
with self.lock:
bucket = self.buckets.get(tenant_id)
if bucket is None:
return False # Unknown tenant
# Refill tokens
now = time.time()
elapsed = now - bucket['last_refill']
bucket['tokens'] = min(
bucket['burst'],
bucket['tokens'] + elapsed * bucket['rate']
)
bucket['last_refill'] = now
# Check and consume
if bucket['tokens'] >= 1.0:
bucket['tokens'] -= 1.0
return True
return False
def get_remaining(self, tenant_id):
"""Get remaining rate limit tokens for a tenant."""
bucket = self.buckets.get(tenant_id)
if bucket is None:
return 0
return int(bucket['tokens'])
class FairScheduler:
"""
Fair scheduler that ensures each tenant gets proportional GPU time.
Uses Weighted Fair Queuing (WFQ).
"""
def __init__(self, tenant_weights):
"""
Args:
tenant_weights: Dict of tenant_id -> weight.
Weight determines share of GPU time.
"""
self.weights = tenant_weights
self.virtual_time = defaultdict(float) # Per-tenant virtual finish time
self.queues = defaultdict(list)
def enqueue(self, request, tenant_id):
"""Add a request to the tenant's queue with virtual timestamp."""
weight = self.weights.get(tenant_id, 1.0)
cost = request.get('estimated_tokens', 1000) / weight
# Virtual finish time = max(current virtual time, last finish time) + cost
vft = max(
self._global_virtual_time(),
self.virtual_time[tenant_id]
) + cost
self.virtual_time[tenant_id] = vft
self.queues[tenant_id].append((vft, request))
def dequeue_batch(self, batch_size):
"""
Select next batch of requests using weighted fair queuing.
Requests with lowest virtual finish time go first.
"""
# Merge all queues and sort by virtual finish time
all_requests = []
for tenant_id, queue in self.queues.items():
for vft, request in queue:
all_requests.append((vft, tenant_id, request))
all_requests.sort(key=lambda x: x[0])
batch = []
selected_per_tenant = defaultdict(int)
for vft, tenant_id, request in all_requests:
if len(batch) >= batch_size:
break
batch.append(request)
selected_per_tenant[tenant_id] += 1
# Remove selected from queues
for tenant_id, count in selected_per_tenant.items():
self.queues[tenant_id] = self.queues[tenant_id][count:]
return batch
def _global_virtual_time(self):
"""Minimum virtual finish time across all tenants."""
if not self.virtual_time:
return 0.0
return min(self.virtual_time.values())
Fair Scheduling Impact on Multi-Tenant Latency
| Scenario | Tenant A P99 TTFT | Tenant B P99 TTFT | Fairness Index |
|---|---|---|---|
| No isolation (FIFO) | 120ms | 1,850ms | 0.32 |
| Rate limiting only | 180ms | 450ms | 0.71 |
| Fair scheduling (equal weight) | 220ms | 235ms | 0.97 |
| Fair scheduling (2:1 weight) | 170ms | 290ms | 0.89 |
| Dedicated GPUs (MIG) | 180ms | 185ms | 0.99 |
Hardware Isolation with NVIDIA MIG
Multi-Instance GPU
MIG (Multi-Instance GPU) partitions a single GPU into up to 7 isolated instances. Each instance has dedicated compute units, memory, and memory bandwidth. One MIG instance cannot access another’s memory, providing hardware-enforced isolation.
class MIGManager:
"""
Manage NVIDIA Multi-Instance GPU partitions for tenant isolation.
MIG partitions available on H100:
- 1g.10gb: 1/7 of GPU (10GB, 16.5 SMs)
- 2g.20gb: 2/7 of GPU (20GB, 33 SMs)
- 3g.40gb: 3/7 of GPU (40GB, 49.5 SMs)
- 4g.40gb: 4/7 of GPU (40GB, 66 SMs)
- 7g.80gb: Full GPU (80GB, all SMs)
"""
MIG_PROFILES = {
'h100': {
'1g.10gb': {'memory_gb': 10, 'sm_count': 16, 'fraction': 1/7},
'2g.20gb': {'memory_gb': 20, 'sm_count': 33, 'fraction': 2/7},
'3g.40gb': {'memory_gb': 40, 'sm_count': 49, 'fraction': 3/7},
'4g.40gb': {'memory_gb': 40, 'sm_count': 66, 'fraction': 4/7},
'7g.80gb': {'memory_gb': 80, 'sm_count': 132, 'fraction': 7/7},
}
}
def __init__(self, gpu_type='h100'):
self.gpu_type = gpu_type
self.profiles = self.MIG_PROFILES[gpu_type]
def plan_partitions(self, tenant_requirements):
"""
Plan MIG partition allocation for a set of tenants.
Args:
tenant_requirements: Dict of tenant_id -> {
'model_memory_gb': required GPU memory,
'throughput_fraction': fraction of GPU needed,
}
"""
allocations = {}
remaining_fractions = 1.0
# Sort tenants by requirement (largest first for better packing)
sorted_tenants = sorted(
tenant_requirements.items(),
key=lambda x: x[1]['throughput_fraction'],
reverse=True,
)
for tenant_id, req in sorted_tenants:
# Find smallest profile that fits
best_profile = None
for profile_name, spec in sorted(
self.profiles.items(), key=lambda x: x[1]['fraction']
):
if (spec['memory_gb'] >= req['model_memory_gb'] and
spec['fraction'] <= remaining_fractions):
best_profile = profile_name
break
if best_profile:
allocations[tenant_id] = {
'profile': best_profile,
'spec': self.profiles[best_profile],
}
remaining_fractions -= self.profiles[best_profile]['fraction']
else:
allocations[tenant_id] = {
'profile': None,
'error': 'No suitable MIG partition available',
}
return allocations
def apply_partitions(self, gpu_id, allocations):
"""
Apply MIG partitions to a physical GPU.
Returns shell commands to execute.
"""
commands = []
# Enable MIG mode
commands.append(f"sudo nvidia-smi -i {gpu_id} -mig 1")
# Create instances
for tenant_id, alloc in allocations.items():
if alloc['profile'] is None:
continue
profile = alloc['profile']
# Map profile name to nvidia-smi profile ID
profile_map = {
'1g.10gb': '19', '2g.20gb': '14', '3g.40gb': '9',
'4g.40gb': '5', '7g.80gb': '0',
}
profile_id = profile_map.get(profile, '19')
commands.append(
f"sudo nvidia-smi mig -i {gpu_id} "
f"-cgi {profile_id} -C"
)
return commands
MIG provides the strongest isolation guarantee but reduces GPU efficiency. A 3g.40gb MIG instance has 40GB memory (half of H100’s 80GB) and 49.5 SMs (37.5% of H100’s 132 SMs). The memory/compute ratio is worse than the full GPU. For LLM serving, where batch size is limited by memory, MIG instances achieve 70-85% of their theoretical throughput. Factor this into capacity planning.
Network Isolation
Per-Tenant Network Segmentation
class NetworkIsolationConfig:
"""
Network isolation for multi-tenant Dynamo deployments.
"""
def generate_network_policy(self, tenant_id, namespace):
"""
Generate Kubernetes NetworkPolicy for tenant isolation.
Each tenant's requests flow through dedicated network paths.
"""
return {
'apiVersion': 'networking.k8s.io/v1',
'kind': 'NetworkPolicy',
'metadata': {
'name': f'tenant-{tenant_id}-policy',
'namespace': namespace,
},
'spec': {
'podSelector': {
'matchLabels': {
'tenant': tenant_id,
},
},
'policyTypes': ['Ingress', 'Egress'],
'ingress': [{
'from': [
# Only allow traffic from tenant's API gateway
{
'podSelector': {
'matchLabels': {
'app': 'dynamo-gateway',
'tenant': tenant_id,
},
},
},
# And from Dynamo internal components
{
'podSelector': {
'matchLabels': {
'app': 'dynamo-router',
},
},
},
],
}],
'egress': [{
'to': [
# Only allow traffic to Dynamo workers
{
'podSelector': {
'matchLabels': {
'app': 'dynamo-worker',
},
},
},
],
}],
},
}
def generate_mtls_config(self, tenant_id):
"""
Generate mTLS configuration for tenant communication.
Every request between Dynamo components is encrypted
with tenant-specific certificates.
"""
return {
'tls': {
'mode': 'STRICT',
'certificate_chain': f'/certs/{tenant_id}/cert.pem',
'private_key': f'/certs/{tenant_id}/key.pem',
'root_ca': f'/certs/{tenant_id}/ca.pem',
},
'peer_authentication': {
'mode': 'STRICT',
'expected_san': f'spiffe://cluster.local/ns/llm-serving/sa/tenant-{tenant_id}',
},
}
Audit Logging
Per-Request Audit Trail
import json
import hashlib
from datetime import datetime
class AuditLogger:
"""
Comprehensive audit logging for compliance.
Every request is logged with tenant attribution,
timing, and content hashes (not content itself).
"""
def __init__(self, log_backend, encryption_key):
self.backend = log_backend
self.encryption_key = encryption_key
def log_request(self, request, response, metrics):
"""
Log a complete request-response cycle.
Logs content hashes, not actual content, to avoid
storing sensitive data in audit logs.
"""
audit_entry = {
'timestamp': datetime.utcnow().isoformat(),
'request_id': request['request_id'],
'tenant_id': request['tenant_id'],
'model': request['model'],
# Content hashes (not actual content)
'prompt_hash': self._hash_content(
json.dumps(request.get('messages', []))
),
'response_hash': self._hash_content(
response.get('text', '')
),
# Metrics
'input_tokens': metrics.get('input_tokens', 0),
'output_tokens': metrics.get('output_tokens', 0),
'ttft_ms': metrics.get('ttft_ms', 0),
'total_latency_ms': metrics.get('total_latency_ms', 0),
# Routing
'worker_id': metrics.get('worker_id', ''),
'routing_strategy': metrics.get('routing_strategy', ''),
'kv_cache_hit': metrics.get('kv_cache_hit', False),
# Status
'status': 'success' if 'error' not in response else 'error',
'error_code': response.get('error', {}).get('code', ''),
# Security
'source_ip': request.get('source_ip', ''),
'api_key_prefix': request.get('api_key', '')[:8] + '...',
}
# Encrypt and store
encrypted = self._encrypt(json.dumps(audit_entry))
self.backend.write(encrypted)
return audit_entry['request_id']
def _hash_content(self, content):
"""Hash content for audit trail without storing actual content."""
return hashlib.sha256(content.encode()).hexdigest()[:16]
def _encrypt(self, data):
"""Encrypt audit log entry at rest."""
# Use AES-256-GCM in production
# Simplified placeholder
return data
def query_tenant_logs(self, tenant_id, start_time, end_time):
"""
Query audit logs for a specific tenant.
Tenants can only see their own logs.
"""
return self.backend.query(
tenant_id=tenant_id,
start_time=start_time,
end_time=end_time,
)
class ComplianceReporter:
"""
Generate compliance reports from audit logs.
"""
def __init__(self, audit_logger):
self.logger = audit_logger
def generate_soc2_report(self, tenant_id, period_start, period_end):
"""Generate SOC2-relevant metrics for a tenant."""
logs = self.logger.query_tenant_logs(
tenant_id, period_start, period_end
)
total_requests = len(logs)
successful = sum(1 for l in logs if l['status'] == 'success')
errors = total_requests - successful
return {
'tenant_id': tenant_id,
'period': f"{period_start} to {period_end}",
'total_requests': total_requests,
'success_rate': successful / max(total_requests, 1),
'error_rate': errors / max(total_requests, 1),
'avg_latency_ms': sum(l['total_latency_ms'] for l in logs) / max(total_requests, 1),
'total_input_tokens': sum(l['input_tokens'] for l in logs),
'total_output_tokens': sum(l['output_tokens'] for l in logs),
'unique_api_keys_used': len(set(l['api_key_prefix'] for l in logs)),
'data_retention': 'Content not stored; hashes only',
'encryption': 'AES-256-GCM at rest, TLS 1.3 in transit',
}
Isolation Levels Summary
Isolation Levels for Multi-Tenant Dynamo
| Level | Mechanism | Isolation Strength | Performance Overhead | Cost Overhead |
|---|---|---|---|---|
| 1: Logical | Per-tenant KV namespacing + rate limits | Medium | Less than 1% | 0% |
| 2: Scheduling | Fair queuing + resource quotas | Medium-High | 2-5% | 0% |
| 3: Process | Separate worker processes per tenant | High | 10-20% | 20-30% |
| 4: MIG | Hardware GPU partitioning | Very High | 15-30% | 30-50% |
| 5: Dedicated | Separate GPU nodes per tenant | Maximum | 0% | 100-400% |
Cost vs Isolation Tradeoff
| Metric | Shared (Level 1-2) | Process (Level 3) | MIG (Level 4) | Dedicated (Level 5) |
|---|---|---|---|---|
| Relative Cost |
Multi-tenant security in Dynamo is a spectrum: from software-level namespacing (cheap, good enough for most use cases) to hardware-level MIG partitioning (expensive, required for regulated industries) to fully dedicated nodes (most expensive, contractually required for some enterprise customers). The right choice depends on the threat model, compliance requirements, and cost tolerance. For most SaaS deployments, Level 1-2 (logical isolation + fair scheduling) provides sufficient security. The critical implementation detail is never sharing KV cache entries across tenants, even when the prefixes are identical — the cost of re-computing a shared system prompt is negligible compared to the risk of cross-tenant information leakage.