Real-world serving doesn’t mean loading one model and calling it a day. Your cluster needs to serve Llama 8B for latency-sensitive chat, Llama 70B for complex reasoning, Llama 405B for research tasks, Mistral 7B for code, plus 50 domain-specific fine-tunes and 200 per-customer LoRA adapters. A single 70B model consumes 140 GB just for weights — two full H100s before you even allocate KV cache. Do the math: 64 H100s × 80 GB = 5120 GB total, but your catalog of models needs 4740 GB of weight memory alone. The entire catalog cannot fit simultaneously. You need strategies for serving more models than your GPUs can hold: temporal sharing (swap models over time), spatial sharing (run multiple models on one GPU), or adapter pools (one base model with swappable LoRA adapters). Each strategy has different latency, memory, and operational tradeoffs that matter when users expect 100ms response times.
The problem: how do you serve more models than your GPUs can simultaneously hold? Three strategies exist: temporal sharing (switch models on a GPU over time), spatial sharing (run multiple models on one GPU simultaneously), and adapter pools (load one base model and swap lightweight LoRA adapters). Each has different latency, throughput, and memory tradeoffs. This post covers all three in implementation detail.
The Multi-Model Inventory Problem
A typical production cluster serves:
Base models:
- Llama 3 8B (16 GB FP16, 8 GB FP8)
- Llama 3 70B (140 GB FP16, 70 GB FP8)
- Llama 3 405B (810 GB FP16, 405 GB FP8)
- Mistral 7B (14 GB FP16)
- Mixtral 8x22B (264 GB FP16, MoE)
Fine-tuned variants per base:
- Chat / Instruct
- Code-specific
- Domain fine-tunes (legal, medical, finance)
- Safety-tuned variants
Total: 5-20 variants per base model
LoRA adapters per base:
- Per-customer adapters (rank 16-64)
- Task-specific adapters (summarization, extraction, translation)
- A/B testing variants
Total: 50-500 adapters per base model
Memory inventory for a 64-GPU H100 cluster:
This does not even account for KV cache (typically 40-60% of GPU memory during serving). The entire model catalog cannot fit simultaneously.
Model Weight Memory Requirements
| Model | FP16 Size | FP8 Size | INT4 Size | GPUs Needed (FP16, 80GB) |
|---|---|---|---|---|
| Llama 3 8B | 16 GB | 8 GB | 4 GB | 1 |
| Llama 3 70B | 140 GB | 70 GB | 35 GB | 2 |
| Llama 3 405B | 810 GB | 405 GB | 203 GB | 8 (TP=8) |
| Mistral 7B | 14 GB | 7 GB | 3.5 GB | 1 |
| Mixtral 8x22B | 264 GB | 132 GB | 66 GB | 4 |
| LoRA adapter (rank 64, 70B) | 0.8 GB | 0.4 GB | — | 0 (shared GPU) |
Temporal Sharing: Model Switching on a GPU
Temporal sharing means one GPU runs different models at different times. When traffic shifts from Model A to Model B, we unload A and load B.
Cold Switch: Load from Disk
The simplest approach: store model weights on NVMe SSD, load them into GPU memory when needed.
import torch
import time
def cold_switch(model_path, device="cuda:0"):
"""
Load model weights from disk to GPU.
This is the slowest switching method.
"""
start = time.perf_counter()
# Step 1: Read from NVMe SSD to CPU pinned memory
# NVMe throughput: ~7 GB/s (Gen4 x4)
state_dict = torch.load(
model_path,
map_location="cpu",
weights_only=True,
)
cpu_load_time = time.perf_counter() - start
# Step 2: Transfer from CPU to GPU
# PCIe 5.0 x16: ~63 GB/s theoretical, ~50 GB/s practical
# For 140 GB (70B FP16): ~2.8 seconds
mid = time.perf_counter()
for key in state_dict:
state_dict[key] = state_dict[key].to(device, non_blocking=True)
torch.cuda.synchronize()
gpu_transfer_time = time.perf_counter() - mid
total_time = time.perf_counter() - start
return state_dict, {
"cpu_load_time": cpu_load_time,
"gpu_transfer_time": gpu_transfer_time,
"total_time": total_time,
}
Cold switch latencies:
For Llama 70B FP16 (140 GB):
During these 23 seconds, the GPU serves zero requests. This is unacceptable for interactive workloads.
Warm Switch: Keep Models in CPU Memory
Keep recently-used models in CPU DRAM. When switching, transfer from CPU to GPU only (skip the disk read):
class WarmModelCache:
"""
Cache model weights in CPU pinned memory for fast GPU loading.
Uses LRU eviction when CPU memory is full.
"""
def __init__(self, max_cpu_memory_gb=500):
self.max_memory = max_cpu_memory_gb * (1024 ** 3)
self.used_memory = 0
self.cache = {} # model_id -> state_dict (CPU pinned)
self.access_order = [] # LRU tracking
def load_to_cpu(self, model_id, model_path):
"""Load model weights into CPU pinned memory."""
if model_id in self.cache:
self._touch(model_id)
return
# Load from disk
state_dict = torch.load(model_path, map_location="cpu", weights_only=True)
# Convert to pinned memory for faster H2D transfer
pinned_dict = {}
model_size = 0
for key, tensor in state_dict.items():
pinned = torch.empty_like(tensor, pin_memory=True)
pinned.copy_(tensor)
pinned_dict[key] = pinned
model_size += tensor.nbytes
# Evict LRU models if necessary
while self.used_memory + model_size > self.max_memory and self.access_order:
self._evict_lru()
self.cache[model_id] = pinned_dict
self.used_memory += model_size
self.access_order.append(model_id)
def switch_to_gpu(self, model_id, device="cuda:0"):
"""Transfer cached model from CPU pinned memory to GPU."""
if model_id not in self.cache:
raise KeyError(f"Model {model_id} not in CPU cache")
self._touch(model_id)
state_dict = self.cache[model_id]
start = time.perf_counter()
# Pinned memory H2D transfer is faster than pageable
# PCIe 5.0 with pinned memory: ~55 GB/s
gpu_dict = {}
for key, tensor in state_dict.items():
gpu_dict[key] = tensor.to(device, non_blocking=True)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
return gpu_dict, elapsed
def _touch(self, model_id):
if model_id in self.access_order:
self.access_order.remove(model_id)
self.access_order.append(model_id)
def _evict_lru(self):
if not self.access_order:
return
victim_id = self.access_order.pop(0)
victim = self.cache.pop(victim_id)
for tensor in victim.values():
self.used_memory -= tensor.nbytes
del victim
Warm switch latency for Llama 70B FP16:
Better than cold, but still 2.5 seconds of downtime per switch.
ModelExpress: Overlapped Transfer
The key optimization: overlap model weight transfer with serving. While the GPU is running the current model, asynchronously transfer the next model’s weights:
class ModelExpressLoader:
"""
Overlapped model loading: transfer weights while GPU is busy serving.
Achieves near-zero-downtime model switches.
"""
def __init__(self, device="cuda:0"):
self.device = device
self.current_model = None
self.next_model_buffer = None
self.transfer_stream = torch.cuda.Stream()
def start_preload(self, state_dict_cpu):
"""
Begin async transfer of next model while current model is serving.
Call this when you predict you will need to switch soon.
"""
# Allocate GPU buffer for the new model
self.next_model_buffer = {}
with torch.cuda.stream(self.transfer_stream):
for key, cpu_tensor in state_dict_cpu.items():
gpu_tensor = torch.empty_like(
cpu_tensor, device=self.device
)
gpu_tensor.copy_(cpu_tensor, non_blocking=True)
self.next_model_buffer[key] = gpu_tensor
def is_preload_complete(self):
"""Check if async transfer is done."""
return self.transfer_stream.query()
def complete_switch(self):
"""
Finalize the model switch. Must be called after preload completes.
The actual "downtime" is just the model initialization (not transfer).
"""
# Wait for transfer to finish (should already be done)
self.transfer_stream.synchronize()
# Free old model memory
if self.current_model is not None:
del self.current_model
torch.cuda.empty_cache()
# The new model weights are already on GPU
self.current_model = self.next_model_buffer
self.next_model_buffer = None
# Model initialization (creating nn.Module, loading state_dict)
# This takes ~200ms regardless of model size
return self.current_model
With overlapped transfer, the effective switch time is:
If we start preloading early enough that , the switch time is just .
During the overlap period, both the current model and the next model’s weights are in GPU memory simultaneously. For Llama 70B FP16, this requires 280 GB of GPU memory for weights alone. On H100 80GB GPUs, this means the technique only works if you are using model parallelism across enough GPUs to have headroom. For a 70B model on 4 GPUs (35 GB weights per GPU), overlapped loading needs 70 GB per GPU — feasible with 80 GB GPUs if KV cache is small.
Model Switch Latency by Strategy (Llama 70B FP16)
(seconds)Spatial Sharing: Two Models on One GPU
Instead of switching models over time, run multiple models simultaneously on the same GPU.
Memory Partitioning
The simplest form: allocate fixed memory fractions to each model.
import torch
def partition_gpu_memory(models, gpu_memory_gb=80):
"""
Partition GPU memory across multiple models.
Each model gets a fraction proportional to its weight size.
"""
total_weight_size = sum(m["weight_size_gb"] for m in models)
kv_cache_fraction = 0.45 # reserve 45% for KV cache
weight_memory = gpu_memory_gb * (1 - kv_cache_fraction)
if total_weight_size > weight_memory:
return None # models do not fit
allocations = []
remaining_kv_memory = gpu_memory_gb * kv_cache_fraction
for model in models:
weight_fraction = model["weight_size_gb"] / total_weight_size
kv_allocation = remaining_kv_memory * weight_fraction
allocations.append({
"model_id": model["id"],
"weight_memory_gb": model["weight_size_gb"],
"kv_cache_memory_gb": kv_allocation,
"max_concurrent_requests": int(
kv_allocation * 1024 / model["kv_per_request_mb"]
),
})
return allocations
# Example: two models on one H100 80GB
models = [
{
"id": "llama-8b-fp8",
"weight_size_gb": 8,
"kv_per_request_mb": 2, # 512 tokens, FP16 KV
},
{
"id": "mistral-7b-fp8",
"weight_size_gb": 7,
"kv_per_request_mb": 1.5,
},
]
allocs = partition_gpu_memory(models, gpu_memory_gb=80)
# llama-8b: 8 GB weights, 21.6 GB KV -> ~11000 concurrent requests
# mistral-7b: 7 GB weights, 18.9 GB KV -> ~12800 concurrent requests
NVIDIA MPS for SM Sharing
CUDA Multi-Process Service (MPS) allows multiple processes to share GPU SMs simultaneously, rather than time-slicing:
# Start MPS daemon
nvidia-cuda-mps-control -d
# Set SM partitioning (optional)
# Give Model A 60% of SMs, Model B 40%
echo "set_active_thread_percentage 0 60" | nvidia-cuda-mps-control
echo "set_active_thread_percentage 1 40" | nvidia-cuda-mps-control
import os
import subprocess
class MPSModelServer:
"""
Run two models on one GPU using MPS for SM sharing.
Each model runs in a separate process.
"""
def __init__(self, gpu_id=0):
self.gpu_id = gpu_id
self.processes = {}
def start_mps(self):
"""Initialize MPS daemon for the GPU."""
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)
env["CUDA_MPS_PIPE_DIRECTORY"] = f"/tmp/mps_pipe_{self.gpu_id}"
env["CUDA_MPS_LOG_DIRECTORY"] = f"/tmp/mps_log_{self.gpu_id}"
os.makedirs(env["CUDA_MPS_PIPE_DIRECTORY"], exist_ok=True)
os.makedirs(env["CUDA_MPS_LOG_DIRECTORY"], exist_ok=True)
subprocess.Popen(
["nvidia-cuda-mps-control", "-d"],
env=env,
)
def launch_model(self, model_id, model_path, sm_percentage):
"""Launch a model server process with an SM allocation."""
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)
env["CUDA_MPS_ACTIVE_THREAD_PERCENTAGE"] = str(sm_percentage)
# Each model server is an independent process
proc = subprocess.Popen(
[
"python", "model_server.py",
"--model", model_path,
"--model-id", model_id,
],
env=env,
)
self.processes[model_id] = proc
Spatial Sharing Limitations
Spatial sharing has fundamental constraints:
1. Memory isolation. CUDA processes share GPU memory but there is no hardware memory protection between them. An out-of-memory allocation in one process can crash both. Production systems must carefully pre-allocate and limit memory per process.
2. SM interference. Even with MPS, two models share the L2 cache, memory controllers, and HBM bandwidth. A compute-heavy model will not slow the other model’s compute, but both compete for memory bandwidth.
3. Error isolation. A CUDA error (e.g., illegal memory access) in one process brings down the entire GPU context, killing all co-located models.
# Memory bandwidth contention measurement
def measure_bandwidth_contention():
"""
When two models share a GPU, each gets less memory bandwidth.
Memory-bound operations (decode) slow down.
Compute-bound operations (large-batch prefill) are less affected.
"""
# Bandwidth contention model:
# Single model: BW_eff = BW_peak * utilization
# Two models: BW_eff_per_model = BW_peak * utilization / contention_factor
#
# Measured contention factors on H100:
contention = {
"both_memory_bound": 1.8, # each gets ~55% of peak BW
"one_memory_one_compute": 1.2, # memory-bound gets ~83%
"both_compute_bound": 1.05, # minimal contention
}
return contention
Spatial Sharing Impact on Per-Model Performance (H100 80GB)
| Configuration | Model A Throughput | Model B Throughput | Total Throughput | vs Dedicated |
|---|---|---|---|---|
| Dedicated (A only) | 3000 tok/s | — | 3000 tok/s | 1.00x |
| Dedicated (B only) | — | 3200 tok/s | 3200 tok/s | 1.00x |
| Spatial 50/50 (decode) | 1700 tok/s | 1800 tok/s | 3500 tok/s | 0.56x each |
| Spatial 50/50 (prefill) | 2600 tok/s | 2800 tok/s | 5400 tok/s | 0.87x each |
| Spatial 70/30 (A heavy) | 2400 tok/s | 900 tok/s | 3300 tok/s | varies |
Spatial sharing works well in controlled benchmarks but is difficult to operate in production. The lack of memory isolation means any model can OOM and crash its neighbor. The performance interference is workload-dependent and hard to predict. Most production systems prefer temporal sharing (model switching) or adapter pools over spatial sharing. MPS-based spatial sharing is primarily used when two models have complementary access patterns (one compute-bound, one memory-bound) and the operator accepts the operational risk.
Adapter Pools: Base Model + N LoRA Adapters
The most memory-efficient multi-model strategy: load one base model and swap lightweight LoRA adapters for each customer or task.
LoRA Memory Math
A LoRA adapter modifies a weight matrix with a low-rank update:
where and , with rank .
Memory per adapter:
For Llama 70B with rank 64, adapting all attention projections (Q, K, V, O per layer, 80 layers):
- Per layer: (4 projections)
- Total:
Compare to the base model: 140 GB. The adapter is 0.48% of the base model size. You can store 100 adapters in 67 GB — less than one copy of the base model.
S-LoRA: Scalable LoRA Serving
S-LoRA (Sheng et al., 2023) addresses the key challenge of serving many LoRA adapters concurrently: different requests in the same batch may require different adapters.
class SLoRAManager:
"""
S-LoRA adapter pool manager.
Handles adapter loading, batched LoRA computation, and adapter scheduling.
"""
def __init__(self, base_model, max_adapters_gpu=20, max_adapters_cpu=200):
self.base_model = base_model
self.hidden_dim = base_model.config.hidden_size
self.num_layers = base_model.config.num_hidden_layers
# Adapter storage
self.gpu_adapters = {} # adapter_id -> {layer_id -> (A, B)} on GPU
self.cpu_adapters = {} # adapter_id -> {layer_id -> (A, B)} on CPU
self.adapter_metadata = {} # adapter_id -> {rank, alpha, size_bytes}
# LRU for GPU adapter cache
self.gpu_access_order = []
self.max_gpu_adapters = max_adapters_gpu
self.max_cpu_adapters = max_adapters_cpu
def load_adapter(self, adapter_id, adapter_path):
"""Load a LoRA adapter from disk into CPU cache."""
adapter_weights = torch.load(adapter_path, weights_only=True)
# Parse LoRA weights into per-layer (A, B) pairs
parsed = {}
rank = None
for key, tensor in adapter_weights.items():
# Expected format: layers.{i}.{proj}.lora_{a|b}.weight
parts = key.split(".")
layer_idx = int(parts[1])
proj_name = parts[2]
lora_part = parts[3] # "lora_a" or "lora_b"
if layer_idx not in parsed:
parsed[layer_idx] = {}
if proj_name not in parsed[layer_idx]:
parsed[layer_idx][proj_name] = {}
parsed[layer_idx][proj_name][lora_part] = tensor
if lora_part == "lora_a" and rank is None:
rank = tensor.shape[0]
self.cpu_adapters[adapter_id] = parsed
total_size = sum(t.nbytes for t in adapter_weights.values())
self.adapter_metadata[adapter_id] = {
"rank": rank,
"size_bytes": total_size,
}
def ensure_on_gpu(self, adapter_id):
"""Ensure adapter is on GPU, loading from CPU if necessary."""
if adapter_id in self.gpu_adapters:
self._touch_gpu(adapter_id)
return
# Evict LRU adapter from GPU if at capacity
while len(self.gpu_adapters) >= self.max_gpu_adapters:
self._evict_gpu_lru()
# Transfer from CPU to GPU
cpu_adapter = self.cpu_adapters[adapter_id]
gpu_adapter = {}
for layer_idx, projs in cpu_adapter.items():
gpu_adapter[layer_idx] = {}
for proj_name, parts in projs.items():
gpu_adapter[layer_idx][proj_name] = {
k: v.cuda(non_blocking=True)
for k, v in parts.items()
}
self.gpu_adapters[adapter_id] = gpu_adapter
self.gpu_access_order.append(adapter_id)
def _touch_gpu(self, adapter_id):
if adapter_id in self.gpu_access_order:
self.gpu_access_order.remove(adapter_id)
self.gpu_access_order.append(adapter_id)
def _evict_gpu_lru(self):
if not self.gpu_access_order:
return
victim = self.gpu_access_order.pop(0)
del self.gpu_adapters[victim]
torch.cuda.empty_cache()
def batched_lora_forward(
hidden_states,
base_weight,
adapter_assignments,
adapter_pool,
layer_idx,
proj_name,
):
"""
Forward pass with multiple LoRA adapters in one batch.
Args:
hidden_states: [total_tokens, hidden_dim]
base_weight: [hidden_dim, output_dim]
adapter_assignments: list of (start_idx, end_idx, adapter_id)
Each entry says tokens[start:end] use adapter_id
adapter_pool: SLoRAManager with adapters on GPU
layer_idx: which transformer layer
proj_name: which projection (q, k, v, o)
"""
# Step 1: Base model computation (same for all tokens)
base_output = hidden_states @ base_weight # [total_tokens, output_dim]
# Step 2: LoRA delta computation (per-adapter)
# Option A: Loop over adapters (simple but sequential)
for start_idx, end_idx, adapter_id in adapter_assignments:
if adapter_id is None:
continue # no adapter for these tokens
adapter = adapter_pool.gpu_adapters[adapter_id][layer_idx][proj_name]
A = adapter["lora_a"] # [rank, hidden_dim]
B = adapter["lora_b"] # [output_dim, rank]
# Tokens for this adapter
x = hidden_states[start_idx:end_idx] # [n_tokens, hidden_dim]
# LoRA computation: x @ A^T @ B^T
delta = x @ A.T @ B.T # [n_tokens, output_dim]
base_output[start_idx:end_idx] += delta
return base_output
SGMV: Custom Kernels for Batched LoRA
The per-adapter loop in batched_lora_forward is inefficient because each LoRA computation is a small GEMM that underutilizes the GPU. S-LoRA uses a custom CUDA kernel called SGMV (Segmented Gather Matrix-Vector) to batch all LoRA computations into one kernel:
def sgmv_batched_lora(
hidden_states,
lora_a_weights, # list of A matrices, one per unique adapter
lora_b_weights, # list of B matrices, one per unique adapter
segment_starts, # start index of each adapter segment
segment_lengths, # number of tokens per segment
adapter_indices, # which adapter each segment uses
):
"""
SGMV kernel: compute all LoRA deltas in one kernel launch.
Instead of N separate small GEMMs (one per adapter), we launch one
kernel that handles all adapters. The kernel:
1. Each thread block processes one segment
2. Loads the appropriate A and B matrices based on adapter_index
3. Computes the LoRA delta for that segment
4. Writes result to the correct output location
This is ~5x faster than the loop approach for 10+ concurrent adapters.
"""
# In practice, this calls a custom CUDA kernel:
# punica.ops.sgmv(hidden_states, lora_a_weights, lora_b_weights,
# segment_starts, segment_lengths, adapter_indices)
pass
LoRA Serving Performance: Loop vs SGMV (H100, Llama 70B base, 20 adapters)
| Method | Throughput (tok/s) | LoRA Overhead vs Base | Max Concurrent Adapters |
|---|---|---|---|
| Base model only (no LoRA) | 3000 | 0% | 0 |
| Naive loop (sequential GEMMs) | 2100 | 30% | 5-10 |
| SGMV batched kernel | 2750 | 8% | 50-100 |
| Merged adapters (offline) | 2950 | 2% | 1 (premerged) |
Adapter Scheduling
With hundreds of adapters and requests arriving in real time, the scheduler must decide which adapters to keep on GPU:
class AdapterScheduler:
"""
Schedule which LoRA adapters to keep on GPU based on traffic patterns.
"""
def __init__(self, adapter_pool, max_gpu_adapters=20):
self.adapter_pool = adapter_pool
self.max_gpu_adapters = max_gpu_adapters
self.request_counts = {} # adapter_id -> recent request count
self.window_size = 100 # count requests in last 100 iterations
def update_traffic(self, batch_adapter_ids):
"""Update traffic statistics from the current batch."""
for adapter_id in batch_adapter_ids:
self.request_counts[adapter_id] = (
self.request_counts.get(adapter_id, 0) + 1
)
def select_gpu_adapters(self, pending_requests):
"""
Decide which adapters should be on GPU for the next iteration.
Strategy: prioritize adapters needed by pending requests,
then keep popular adapters.
"""
# Must-have: adapters needed by currently running requests
needed_now = set()
for req in pending_requests:
if req.adapter_id is not None:
needed_now.add(req.adapter_id)
# Nice-to-have: popular adapters likely to be needed soon
popular = sorted(
self.request_counts.keys(),
key=lambda a: self.request_counts[a],
reverse=True,
)
selected = list(needed_now)
remaining_slots = self.max_gpu_adapters - len(selected)
for adapter_id in popular:
if remaining_slots <= 0:
break
if adapter_id not in needed_now:
selected.append(adapter_id)
remaining_slots -= 1
# Ensure selected adapters are on GPU
for adapter_id in selected:
self.adapter_pool.ensure_on_gpu(adapter_id)
return selected
If the system can predict which adapters will be needed (e.g., based on user session data or API key mapping), it can prefetch adapters to GPU before requests arrive. This eliminates the 15ms adapter swap latency entirely. In practice, most adapter accesses follow a Zipf distribution — a small number of popular adapters handle most traffic, and they stay hot in the GPU cache permanently.
Traffic-Based Routing
With multiple GPUs each potentially hosting different models or adapters, the router must send each request to a GPU that already has the right model loaded.
Affinity-Based Routing
import hashlib
from collections import defaultdict
class ModelAffinityRouter:
"""
Route requests to GPUs based on model affinity.
Goal: minimize model switches by sending requests for the same
model to the same GPU.
"""
def __init__(self, gpu_ids):
self.gpu_ids = gpu_ids
self.gpu_models = {} # gpu_id -> currently loaded model_id
self.gpu_adapters = defaultdict(set) # gpu_id -> set of loaded adapter_ids
self.gpu_queue_depth = defaultdict(int) # gpu_id -> pending requests
self.model_to_gpus = defaultdict(set) # model_id -> set of gpu_ids
def route_request(self, model_id, adapter_id=None):
"""
Select the best GPU for this request.
Priority:
1. GPU with correct model AND adapter already loaded (zero switch cost)
2. GPU with correct model loaded (adapter swap: ~15ms)
3. GPU with model in warm cache (warm switch: ~2.5s)
4. Least loaded GPU (cold switch: ~23s)
"""
# Priority 1: exact match (model + adapter)
if adapter_id:
exact_matches = [
gpu_id for gpu_id in self.gpu_ids
if (self.gpu_models.get(gpu_id) == model_id
and adapter_id in self.gpu_adapters[gpu_id])
]
if exact_matches:
return self._select_least_loaded(exact_matches)
# Priority 2: model match (adapter not loaded but model is)
model_matches = [
gpu_id for gpu_id in self.gpu_ids
if self.gpu_models.get(gpu_id) == model_id
]
if model_matches:
return self._select_least_loaded(model_matches)
# Priority 3: any GPU with warm cache for this model
warm_matches = [
gpu_id for gpu_id in self.gpu_ids
if self._has_warm_cache(gpu_id, model_id)
]
if warm_matches:
return self._select_least_loaded(warm_matches)
# Priority 4: least loaded GPU (will cold switch)
return self._select_least_loaded(self.gpu_ids)
def _select_least_loaded(self, candidates):
"""Among candidate GPUs, pick the one with fewest pending requests."""
return min(candidates, key=lambda g: self.gpu_queue_depth[g])
def _has_warm_cache(self, gpu_id, model_id):
"""Check if GPU has model weights in CPU memory (warm cache)."""
# Implementation depends on warm cache integration
return False # placeholder
def update_gpu_state(self, gpu_id, model_id, adapter_ids):
"""Called by GPU workers to report their current state."""
old_model = self.gpu_models.get(gpu_id)
if old_model and old_model != model_id:
self.model_to_gpus[old_model].discard(gpu_id)
self.gpu_models[gpu_id] = model_id
self.gpu_adapters[gpu_id] = set(adapter_ids)
self.model_to_gpus[model_id].add(gpu_id)
Consistent Hashing for Sticky Routing
For stateless routing (where the router does not track GPU state), consistent hashing provides model-GPU affinity:
import bisect
class ConsistentHashRouter:
"""
Consistent hashing for model-to-GPU routing.
Models with the same ID always route to the same GPU subset.
Adding/removing GPUs minimally disrupts routing.
"""
def __init__(self, gpu_ids, virtual_nodes=150):
self.ring = [] # sorted list of (hash, gpu_id)
self.gpu_ids = set(gpu_ids)
for gpu_id in gpu_ids:
for i in range(virtual_nodes):
h = self._hash(f"{gpu_id}:{i}")
self.ring.append((h, gpu_id))
self.ring.sort(key=lambda x: x[0])
self.ring_hashes = [h for h, _ in self.ring]
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def route(self, model_id, num_replicas=1):
"""
Find GPUs for a model.
Returns num_replicas GPU IDs for load balancing.
"""
h = self._hash(model_id)
idx = bisect.bisect_left(self.ring_hashes, h) % len(self.ring)
selected = []
seen = set()
while len(selected) < num_replicas:
_, gpu_id = self.ring[idx % len(self.ring)]
if gpu_id not in seen:
selected.append(gpu_id)
seen.add(gpu_id)
idx += 1
return selected
Request Routing Hit Rate (model already loaded on target GPU)
(%)Implementation: Model Registry with LRU Cache and Adapter Pool
Here is a complete model registry that ties together model switching, adapter management, and routing:
import time
import threading
from collections import OrderedDict
from dataclasses import dataclass, field
from enum import Enum
class ModelState(Enum):
ON_GPU = "on_gpu"
IN_CPU_CACHE = "in_cpu_cache"
ON_DISK = "on_disk"
LOADING = "loading"
@dataclass
class ModelEntry:
model_id: str
model_path: str
weight_size_bytes: int
dtype: str = "float16"
# Runtime state
state: ModelState = ModelState.ON_DISK
gpu_id: int = -1
last_access_time: float = 0.0
request_count: int = 0
cpu_state_dict: dict = field(default=None, repr=False)
gpu_state_dict: dict = field(default=None, repr=False)
@dataclass
class AdapterEntry:
adapter_id: str
base_model_id: str
adapter_path: str
rank: int
size_bytes: int
# Runtime
on_gpu: bool = False
gpu_id: int = -1
cpu_weights: dict = field(default=None, repr=False)
gpu_weights: dict = field(default=None, repr=False)
last_access_time: float = 0.0
request_count: int = 0
class ModelRegistry:
"""
Central registry for multi-model serving.
Manages model lifecycle, adapter pools, and GPU allocation.
"""
def __init__(
self,
gpu_ids,
gpu_memory_gb=80,
cpu_cache_gb=500,
max_adapters_per_gpu=20,
):
self.gpu_ids = gpu_ids
self.gpu_memory_gb = gpu_memory_gb
self.cpu_cache_gb = cpu_cache_gb
self.max_adapters_per_gpu = max_adapters_per_gpu
# Model registry
self.models = {} # model_id -> ModelEntry
# Adapter registry
self.adapters = {} # adapter_id -> AdapterEntry
# GPU state
self.gpu_loaded_model = {} # gpu_id -> model_id
self.gpu_loaded_adapters = { # gpu_id -> OrderedDict (LRU)
gpu_id: OrderedDict() for gpu_id in gpu_ids
}
self.gpu_memory_used = {gpu_id: 0 for gpu_id in gpu_ids}
# CPU LRU cache
self.cpu_cache = OrderedDict() # model_id -> state_dict
self.cpu_cache_used = 0
# Lock for thread safety
self.lock = threading.Lock()
# Metrics
self.switch_count = 0
self.cache_hits = 0
self.cache_misses = 0
def register_model(self, model_id, model_path, weight_size_gb, dtype="float16"):
"""Register a model in the catalog."""
self.models[model_id] = ModelEntry(
model_id=model_id,
model_path=model_path,
weight_size_bytes=int(weight_size_gb * 1024**3),
dtype=dtype,
)
def register_adapter(self, adapter_id, base_model_id, adapter_path, rank):
"""Register a LoRA adapter."""
# Estimate adapter size
base = self.models[base_model_id]
# Rough estimate: 2 * hidden_dim * rank * num_layers * num_projections * dtype_size
hidden_dim = 8192 if "70b" in base_model_id.lower() else 4096
num_layers = 80 if "70b" in base_model_id.lower() else 32
num_projections = 4 # Q, K, V, O
dtype_size = 2 if base.dtype == "float16" else 1
size = 2 * hidden_dim * rank * num_layers * num_projections * dtype_size
self.adapters[adapter_id] = AdapterEntry(
adapter_id=adapter_id,
base_model_id=base_model_id,
adapter_path=adapter_path,
rank=rank,
size_bytes=size,
)
def get_model_for_request(self, model_id, adapter_id, gpu_id):
"""
Ensure model and adapter are loaded on the specified GPU.
Returns estimated switch latency.
"""
with self.lock:
switch_latency_ms = 0
# Step 1: Ensure base model is loaded
if self.gpu_loaded_model.get(gpu_id) != model_id:
switch_latency_ms = self._switch_model(model_id, gpu_id)
# Step 2: Ensure adapter is loaded (if requested)
if adapter_id:
adapter_latency = self._load_adapter(adapter_id, gpu_id)
switch_latency_ms += adapter_latency
# Update metrics
model = self.models[model_id]
model.last_access_time = time.time()
model.request_count += 1
if adapter_id:
adapter = self.adapters[adapter_id]
adapter.last_access_time = time.time()
adapter.request_count += 1
return switch_latency_ms
def _switch_model(self, model_id, gpu_id):
"""Switch the model on a GPU. Returns latency in ms."""
model = self.models[model_id]
old_model_id = self.gpu_loaded_model.get(gpu_id)
# Save old model to CPU cache if present
if old_model_id:
self._save_to_cpu_cache(old_model_id, gpu_id)
# Check CPU cache for new model
if model_id in self.cpu_cache:
# Warm switch
self.cache_hits += 1
state_dict = self.cpu_cache[model_id]
latency_ms = self._transfer_cpu_to_gpu(state_dict, gpu_id)
self.gpu_loaded_model[gpu_id] = model_id
model.state = ModelState.ON_GPU
model.gpu_id = gpu_id
self.switch_count += 1
return latency_ms
else:
# Cold switch
self.cache_misses += 1
state_dict = self._load_from_disk(model.model_path)
latency_ms = self._transfer_cpu_to_gpu(state_dict, gpu_id)
# Also cache in CPU memory
self._add_to_cpu_cache(model_id, state_dict)
self.gpu_loaded_model[gpu_id] = model_id
model.state = ModelState.ON_GPU
model.gpu_id = gpu_id
self.switch_count += 1
return latency_ms + self._estimate_disk_load_ms(model.weight_size_bytes)
def _load_adapter(self, adapter_id, gpu_id):
"""Load an adapter onto GPU. Returns latency in ms."""
adapter_cache = self.gpu_loaded_adapters[gpu_id]
if adapter_id in adapter_cache:
# Already loaded, move to end (most recently used)
adapter_cache.move_to_end(adapter_id)
return 0.0 # no latency
# Need to load adapter
adapter = self.adapters[adapter_id]
# Evict LRU adapter if at capacity
while len(adapter_cache) >= self.max_adapters_per_gpu:
evicted_id, _ = adapter_cache.popitem(last=False)
evicted = self.adapters[evicted_id]
evicted.on_gpu = False
evicted.gpu_weights = None
# Load adapter weights
weights = torch.load(adapter.adapter_path, weights_only=True)
gpu_weights = {k: v.cuda(gpu_id) for k, v in weights.items()}
adapter_cache[adapter_id] = gpu_weights
adapter.on_gpu = True
adapter.gpu_id = gpu_id
adapter.gpu_weights = gpu_weights
# Adapter transfer: typically 0.4-1.5 GB at ~50 GB/s
latency_ms = (adapter.size_bytes / (50 * 1024**3)) * 1000
return latency_ms
def _save_to_cpu_cache(self, model_id, gpu_id):
"""Save GPU model weights to CPU cache."""
# In real implementation, copy weights from GPU to pinned CPU memory
# For now, just track the cache state
model = self.models[model_id]
model.state = ModelState.IN_CPU_CACHE
def _add_to_cpu_cache(self, model_id, state_dict):
"""Add model to CPU LRU cache, evicting if necessary."""
model = self.models[model_id]
size = model.weight_size_bytes
# Evict LRU entries until space is available
while (
self.cpu_cache_used + size > self.cpu_cache_gb * 1024**3
and self.cpu_cache
):
evicted_id, evicted_dict = self.cpu_cache.popitem(last=False)
evicted = self.models[evicted_id]
self.cpu_cache_used -= evicted.weight_size_bytes
evicted.state = ModelState.ON_DISK
del evicted_dict
self.cpu_cache[model_id] = state_dict
self.cpu_cache_used += size
model.state = ModelState.IN_CPU_CACHE
def _load_from_disk(self, model_path):
"""Load model weights from disk."""
return torch.load(model_path, map_location="cpu", weights_only=True)
def _transfer_cpu_to_gpu(self, state_dict, gpu_id):
"""Transfer weights from CPU to GPU. Returns estimated latency in ms."""
total_bytes = sum(t.nbytes for t in state_dict.values())
# PCIe 5.0 with pinned memory: ~55 GB/s
return (total_bytes / (55 * 1024**3)) * 1000
def _estimate_disk_load_ms(self, size_bytes):
"""Estimate disk load time."""
# NVMe Gen4 x4: ~7 GB/s
return (size_bytes / (7 * 1024**3)) * 1000
def get_stats(self):
"""Return registry statistics."""
return {
"total_models": len(self.models),
"total_adapters": len(self.adapters),
"models_on_gpu": sum(
1 for m in self.models.values() if m.state == ModelState.ON_GPU
),
"models_in_cpu_cache": sum(
1 for m in self.models.values() if m.state == ModelState.IN_CPU_CACHE
),
"adapters_on_gpu": sum(
len(adapters) for adapters in self.gpu_loaded_adapters.values()
),
"model_switches": self.switch_count,
"cache_hit_rate": (
self.cache_hits / (self.cache_hits + self.cache_misses)
if (self.cache_hits + self.cache_misses) > 0
else 0.0
),
"gpu_utilization": {
gpu_id: self.gpu_loaded_model.get(gpu_id, "empty")
for gpu_id in self.gpu_ids
},
}
# Usage example
def setup_multi_model_cluster():
registry = ModelRegistry(
gpu_ids=list(range(8)),
gpu_memory_gb=80,
cpu_cache_gb=500,
max_adapters_per_gpu=20,
)
# Register base models
registry.register_model("llama-8b", "/models/llama-8b", weight_size_gb=16)
registry.register_model("llama-70b", "/models/llama-70b", weight_size_gb=140)
registry.register_model("mistral-7b", "/models/mistral-7b", weight_size_gb=14)
# Register adapters
for i in range(100):
registry.register_adapter(
adapter_id=f"customer_{i}_adapter",
base_model_id="llama-70b",
adapter_path=f"/adapters/customer_{i}.pt",
rank=64,
)
# Simulate request routing
requests = [
("llama-70b", "customer_0_adapter", 0),
("llama-70b", "customer_1_adapter", 0),
("llama-8b", None, 1),
("llama-70b", "customer_0_adapter", 2),
("mistral-7b", None, 3),
]
for model_id, adapter_id, gpu_id in requests:
latency = registry.get_model_for_request(model_id, adapter_id, gpu_id)
print(f"Model={model_id}, Adapter={adapter_id}, GPU={gpu_id}, "
f"Switch latency={latency:.1f}ms")
print(registry.get_stats())
Cost Optimization: When to Use Each Strategy
The choice between temporal sharing, spatial sharing, and adapter pools depends on the traffic pattern:
Multi-Model Strategy Decision Matrix
| Strategy | Best When | Switch Cost | Memory Efficiency | Operational Complexity |
|---|---|---|---|---|
| Temporal (cold) | Rare model switches, batch jobs | 10-25s | Excellent (1 model per GPU) | Low |
| Temporal (warm) | Moderate switch frequency | 2-3s | Good (CPU memory needed) | Medium |
| Temporal (overlapped) | Predictable traffic patterns | 200ms | Poor (2x GPU memory during switch) | High |
| Spatial (MPS) | Two complementary models | 0 | Moderate (memory split) | High |
| Adapter pool | Many variants of one base | 15ms | Excellent (adapters are tiny) | Medium |
Cost Model
The total cost of serving models on GPUs:
where the switch penalty is:
For a cluster doing 100 model switches per hour with warm switching (2.5s each):
At \0.003$2.25$144$11.50$/hour.
With adapter pools, 100 “switches” per hour cost:
Hourly Cost of Model Switching (100 switches/hour, 3000 tok/s per GPU)
($/hour)If your multi-model requirements can be served by LoRA adapters (same base model, task-specific customization), always prefer adapter pools. The memory overhead is negligible (0.5% per adapter), switch time is 15ms, and SGMV kernels make concurrent adapter serving efficient. Reserve temporal model switching for fundamentally different base models (e.g., switching between 8B and 70B, or between Llama and Mistral).
Monitoring and Autoscaling
A multi-model cluster needs monitoring to detect when to scale or rebalance:
import time
from collections import defaultdict
class MultiModelMonitor:
"""Monitor multi-model serving cluster health."""
def __init__(self, registry):
self.registry = registry
self.model_latencies = defaultdict(list) # model_id -> [latencies]
self.switch_events = [] # (timestamp, gpu_id, from, to)
self.adapter_swap_events = []
self.queue_depths = defaultdict(list) # model_id -> [queue_depths]
def record_request(self, model_id, adapter_id, latency_ms, gpu_id):
"""Record a completed request."""
self.model_latencies[model_id].append(latency_ms)
def record_switch(self, gpu_id, from_model, to_model, switch_time_ms):
"""Record a model switch event."""
self.switch_events.append({
"timestamp": time.time(),
"gpu_id": gpu_id,
"from_model": from_model,
"to_model": to_model,
"switch_time_ms": switch_time_ms,
})
def get_scaling_recommendations(self):
"""
Analyze metrics and recommend scaling actions.
"""
recommendations = []
# Check switch frequency per GPU
now = time.time()
recent_switches = [
s for s in self.switch_events
if now - s["timestamp"] < 3600 # last hour
]
switch_rate_per_gpu = defaultdict(int)
for s in recent_switches:
switch_rate_per_gpu[s["gpu_id"]] += 1
for gpu_id, count in switch_rate_per_gpu.items():
if count > 50: # more than 50 switches/hour
model = self.registry.gpu_loaded_model.get(gpu_id)
recommendations.append({
"action": "add_dedicated_gpu",
"reason": f"GPU {gpu_id} switching {count}x/hour",
"target_model": model,
"estimated_savings_per_hour": count * 2.5 * 0.003,
})
# Check queue depth per model
for model_id, depths in self.queue_depths.items():
if depths and sum(depths[-10:]) / 10 > 50:
recommendations.append({
"action": "scale_up",
"reason": f"Model {model_id} avg queue depth above 50",
"target_model": model_id,
"suggested_additional_gpus": 1,
})
# Check adapter cache hit rate
stats = self.registry.get_stats()
if stats["cache_hit_rate"] < 0.7:
recommendations.append({
"action": "increase_cpu_cache",
"reason": f"Cache hit rate {stats['cache_hit_rate']:.1%} below 70%",
"suggested_increase_gb": 100,
})
return recommendations
Autoscaling Triggers
# Autoscaling policy (simplified):
autoscale_config = {
"scale_up_triggers": [
{
"metric": "p99_latency_ms",
"threshold": 200,
"window_minutes": 5,
"action": "add_gpu_for_model",
},
{
"metric": "switch_rate_per_gpu_per_hour",
"threshold": 30,
"window_minutes": 60,
"action": "add_dedicated_gpu",
},
{
"metric": "adapter_cache_miss_rate",
"threshold": 0.3,
"window_minutes": 15,
"action": "increase_adapter_cache",
},
],
"scale_down_triggers": [
{
"metric": "gpu_utilization_percent",
"threshold": 20,
"window_minutes": 30,
"action": "consolidate_models",
},
],
"cooldown_minutes": 10,
}