Part of Series Inference Optimization Timeline 25 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Real-world serving doesn’t mean loading one model and calling it a day. Your cluster needs to serve Llama 8B for latency-sensitive chat, Llama 70B for complex reasoning, Llama 405B for research tasks, Mistral 7B for code, plus 50 domain-specific fine-tunes and 200 per-customer LoRA adapters. A single 70B model consumes 140 GB just for weights — two full H100s before you even allocate KV cache. Do the math: 64 H100s × 80 GB = 5120 GB total, but your catalog of models needs 4740 GB of weight memory alone. The entire catalog cannot fit simultaneously. You need strategies for serving more models than your GPUs can hold: temporal sharing (swap models over time), spatial sharing (run multiple models on one GPU), or adapter pools (one base model with swappable LoRA adapters). Each strategy has different latency, memory, and operational tradeoffs that matter when users expect 100ms response times.

The problem: how do you serve more models than your GPUs can simultaneously hold? Three strategies exist: temporal sharing (switch models on a GPU over time), spatial sharing (run multiple models on one GPU simultaneously), and adapter pools (load one base model and swap lightweight LoRA adapters). Each has different latency, throughput, and memory tradeoffs. This post covers all three in implementation detail.

The Multi-Model Inventory Problem

A typical production cluster serves:

Base models:
  - Llama 3 8B     (16 GB FP16, 8 GB FP8)
  - Llama 3 70B    (140 GB FP16, 70 GB FP8)
  - Llama 3 405B   (810 GB FP16, 405 GB FP8)
  - Mistral 7B     (14 GB FP16)
  - Mixtral 8x22B  (264 GB FP16, MoE)

Fine-tuned variants per base:
  - Chat / Instruct
  - Code-specific
  - Domain fine-tunes (legal, medical, finance)
  - Safety-tuned variants
  Total: 5-20 variants per base model

LoRA adapters per base:
  - Per-customer adapters (rank 16-64)
  - Task-specific adapters (summarization, extraction, translation)
  - A/B testing variants
  Total: 50-500 adapters per base model

Memory inventory for a 64-GPU H100 cluster:

Total GPU memory=64×80 GB=5120 GB\text{Total GPU memory} = 64 \times 80 \text{ GB} = 5120 \text{ GB}

Total model weight memory needed140×20+16×20+810×2=4740 GB\text{Total model weight memory needed} \approx 140 \times 20 + 16 \times 20 + 810 \times 2 = 4740 \text{ GB}

This does not even account for KV cache (typically 40-60% of GPU memory during serving). The entire model catalog cannot fit simultaneously.

📊

Model Weight Memory Requirements

ModelFP16 SizeFP8 SizeINT4 SizeGPUs Needed (FP16, 80GB)
Llama 3 8B 16 GB 8 GB 4 GB 1
Llama 3 70B 140 GB 70 GB 35 GB 2
Llama 3 405B 810 GB 405 GB 203 GB 8 (TP=8)
Mistral 7B 14 GB 7 GB 3.5 GB 1
Mixtral 8x22B 264 GB 132 GB 66 GB 4
LoRA adapter (rank 64, 70B) 0.8 GB 0.4 GB 0 (shared GPU)
Note: FP16 size = 2 bytes x parameter count. FP8 = 1 byte. INT4 = 0.5 bytes. GPU count assumes 80 GB per GPU with 50% memory for weights (rest for KV cache). LoRA adapters are much smaller than base models.

Temporal Sharing: Model Switching on a GPU

Temporal sharing means one GPU runs different models at different times. When traffic shifts from Model A to Model B, we unload A and load B.

Cold Switch: Load from Disk

The simplest approach: store model weights on NVMe SSD, load them into GPU memory when needed.

import torch
import time

def cold_switch(model_path, device="cuda:0"):
    """
    Load model weights from disk to GPU.
    This is the slowest switching method.
    """
    start = time.perf_counter()

    # Step 1: Read from NVMe SSD to CPU pinned memory
    # NVMe throughput: ~7 GB/s (Gen4 x4)
    state_dict = torch.load(
        model_path,
        map_location="cpu",
        weights_only=True,
    )

    cpu_load_time = time.perf_counter() - start

    # Step 2: Transfer from CPU to GPU
    # PCIe 5.0 x16: ~63 GB/s theoretical, ~50 GB/s practical
    # For 140 GB (70B FP16): ~2.8 seconds
    mid = time.perf_counter()

    for key in state_dict:
        state_dict[key] = state_dict[key].to(device, non_blocking=True)
    torch.cuda.synchronize()

    gpu_transfer_time = time.perf_counter() - mid
    total_time = time.perf_counter() - start

    return state_dict, {
        "cpu_load_time": cpu_load_time,
        "gpu_transfer_time": gpu_transfer_time,
        "total_time": total_time,
    }

Cold switch latencies:

tcold=SmodelBWdisk+SmodelBWPCIe+tinitt_{\text{cold}} = \frac{S_{\text{model}}}{\text{BW}_{\text{disk}}} + \frac{S_{\text{model}}}{\text{BW}_{\text{PCIe}}} + t_{\text{init}}

For Llama 70B FP16 (140 GB):

tcold=1407+14050+0.5=20.0+2.8+0.5=23.3 secondst_{\text{cold}} = \frac{140}{7} + \frac{140}{50} + 0.5 = 20.0 + 2.8 + 0.5 = 23.3 \text{ seconds}

During these 23 seconds, the GPU serves zero requests. This is unacceptable for interactive workloads.

Warm Switch: Keep Models in CPU Memory

Keep recently-used models in CPU DRAM. When switching, transfer from CPU to GPU only (skip the disk read):

class WarmModelCache:
    """
    Cache model weights in CPU pinned memory for fast GPU loading.
    Uses LRU eviction when CPU memory is full.
    """

    def __init__(self, max_cpu_memory_gb=500):
        self.max_memory = max_cpu_memory_gb * (1024 ** 3)
        self.used_memory = 0
        self.cache = {}       # model_id -> state_dict (CPU pinned)
        self.access_order = []  # LRU tracking

    def load_to_cpu(self, model_id, model_path):
        """Load model weights into CPU pinned memory."""
        if model_id in self.cache:
            self._touch(model_id)
            return

        # Load from disk
        state_dict = torch.load(model_path, map_location="cpu", weights_only=True)

        # Convert to pinned memory for faster H2D transfer
        pinned_dict = {}
        model_size = 0
        for key, tensor in state_dict.items():
            pinned = torch.empty_like(tensor, pin_memory=True)
            pinned.copy_(tensor)
            pinned_dict[key] = pinned
            model_size += tensor.nbytes

        # Evict LRU models if necessary
        while self.used_memory + model_size > self.max_memory and self.access_order:
            self._evict_lru()

        self.cache[model_id] = pinned_dict
        self.used_memory += model_size
        self.access_order.append(model_id)

    def switch_to_gpu(self, model_id, device="cuda:0"):
        """Transfer cached model from CPU pinned memory to GPU."""
        if model_id not in self.cache:
            raise KeyError(f"Model {model_id} not in CPU cache")

        self._touch(model_id)
        state_dict = self.cache[model_id]

        start = time.perf_counter()

        # Pinned memory H2D transfer is faster than pageable
        # PCIe 5.0 with pinned memory: ~55 GB/s
        gpu_dict = {}
        for key, tensor in state_dict.items():
            gpu_dict[key] = tensor.to(device, non_blocking=True)
        torch.cuda.synchronize()

        elapsed = time.perf_counter() - start
        return gpu_dict, elapsed

    def _touch(self, model_id):
        if model_id in self.access_order:
            self.access_order.remove(model_id)
        self.access_order.append(model_id)

    def _evict_lru(self):
        if not self.access_order:
            return
        victim_id = self.access_order.pop(0)
        victim = self.cache.pop(victim_id)
        for tensor in victim.values():
            self.used_memory -= tensor.nbytes
        del victim

Warm switch latency for Llama 70B FP16:

twarm=140 GB55 GB/s=2.5 secondst_{\text{warm}} = \frac{140 \text{ GB}}{55 \text{ GB/s}} = 2.5 \text{ seconds}

Better than cold, but still 2.5 seconds of downtime per switch.

ModelExpress: Overlapped Transfer

The key optimization: overlap model weight transfer with serving. While the GPU is running the current model, asynchronously transfer the next model’s weights:

class ModelExpressLoader:
    """
    Overlapped model loading: transfer weights while GPU is busy serving.
    Achieves near-zero-downtime model switches.
    """

    def __init__(self, device="cuda:0"):
        self.device = device
        self.current_model = None
        self.next_model_buffer = None
        self.transfer_stream = torch.cuda.Stream()

    def start_preload(self, state_dict_cpu):
        """
        Begin async transfer of next model while current model is serving.
        Call this when you predict you will need to switch soon.
        """
        # Allocate GPU buffer for the new model
        self.next_model_buffer = {}

        with torch.cuda.stream(self.transfer_stream):
            for key, cpu_tensor in state_dict_cpu.items():
                gpu_tensor = torch.empty_like(
                    cpu_tensor, device=self.device
                )
                gpu_tensor.copy_(cpu_tensor, non_blocking=True)
                self.next_model_buffer[key] = gpu_tensor

    def is_preload_complete(self):
        """Check if async transfer is done."""
        return self.transfer_stream.query()

    def complete_switch(self):
        """
        Finalize the model switch. Must be called after preload completes.
        The actual "downtime" is just the model initialization (not transfer).
        """
        # Wait for transfer to finish (should already be done)
        self.transfer_stream.synchronize()

        # Free old model memory
        if self.current_model is not None:
            del self.current_model
            torch.cuda.empty_cache()

        # The new model weights are already on GPU
        self.current_model = self.next_model_buffer
        self.next_model_buffer = None

        # Model initialization (creating nn.Module, loading state_dict)
        # This takes ~200ms regardless of model size
        return self.current_model

With overlapped transfer, the effective switch time is:

texpress=tinit+max(0,ttransfertoverlap)t_{\text{express}} = t_{\text{init}} + \max(0, t_{\text{transfer}} - t_{\text{overlap}})

If we start preloading early enough that toverlapttransfert_{\text{overlap}} \geq t_{\text{transfer}}, the switch time is just tinit200mst_{\text{init}} \approx 200\text{ms}.

Memory Requirement for Overlapped Transfer

During the overlap period, both the current model and the next model’s weights are in GPU memory simultaneously. For Llama 70B FP16, this requires 280 GB of GPU memory for weights alone. On H100 80GB GPUs, this means the technique only works if you are using model parallelism across enough GPUs to have headroom. For a 70B model on 4 GPUs (35 GB weights per GPU), overlapped loading needs 70 GB per GPU — feasible with 80 GB GPUs if KV cache is small.

Model Switch Latency by Strategy (Llama 70B FP16)

(seconds)
Cold switch (NVMe to GPU)
23.3 seconds
Warm switch (CPU DRAM to GPU)
2.5 seconds
ModelExpress (overlapped)
0.2 seconds
Adapter swap (LoRA rank 64)
0.015 seconds

Spatial Sharing: Two Models on One GPU

Instead of switching models over time, run multiple models simultaneously on the same GPU.

Memory Partitioning

The simplest form: allocate fixed memory fractions to each model.

import torch

def partition_gpu_memory(models, gpu_memory_gb=80):
    """
    Partition GPU memory across multiple models.
    Each model gets a fraction proportional to its weight size.
    """
    total_weight_size = sum(m["weight_size_gb"] for m in models)
    kv_cache_fraction = 0.45  # reserve 45% for KV cache
    weight_memory = gpu_memory_gb * (1 - kv_cache_fraction)

    if total_weight_size > weight_memory:
        return None  # models do not fit

    allocations = []
    remaining_kv_memory = gpu_memory_gb * kv_cache_fraction

    for model in models:
        weight_fraction = model["weight_size_gb"] / total_weight_size
        kv_allocation = remaining_kv_memory * weight_fraction

        allocations.append({
            "model_id": model["id"],
            "weight_memory_gb": model["weight_size_gb"],
            "kv_cache_memory_gb": kv_allocation,
            "max_concurrent_requests": int(
                kv_allocation * 1024 / model["kv_per_request_mb"]
            ),
        })

    return allocations

# Example: two models on one H100 80GB
models = [
    {
        "id": "llama-8b-fp8",
        "weight_size_gb": 8,
        "kv_per_request_mb": 2,  # 512 tokens, FP16 KV
    },
    {
        "id": "mistral-7b-fp8",
        "weight_size_gb": 7,
        "kv_per_request_mb": 1.5,
    },
]

allocs = partition_gpu_memory(models, gpu_memory_gb=80)
# llama-8b: 8 GB weights, 21.6 GB KV -> ~11000 concurrent requests
# mistral-7b: 7 GB weights, 18.9 GB KV -> ~12800 concurrent requests

NVIDIA MPS for SM Sharing

CUDA Multi-Process Service (MPS) allows multiple processes to share GPU SMs simultaneously, rather than time-slicing:

# Start MPS daemon
nvidia-cuda-mps-control -d

# Set SM partitioning (optional)
# Give Model A 60% of SMs, Model B 40%
echo "set_active_thread_percentage 0 60" | nvidia-cuda-mps-control
echo "set_active_thread_percentage 1 40" | nvidia-cuda-mps-control
import os
import subprocess

class MPSModelServer:
    """
    Run two models on one GPU using MPS for SM sharing.
    Each model runs in a separate process.
    """

    def __init__(self, gpu_id=0):
        self.gpu_id = gpu_id
        self.processes = {}

    def start_mps(self):
        """Initialize MPS daemon for the GPU."""
        env = os.environ.copy()
        env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)
        env["CUDA_MPS_PIPE_DIRECTORY"] = f"/tmp/mps_pipe_{self.gpu_id}"
        env["CUDA_MPS_LOG_DIRECTORY"] = f"/tmp/mps_log_{self.gpu_id}"

        os.makedirs(env["CUDA_MPS_PIPE_DIRECTORY"], exist_ok=True)
        os.makedirs(env["CUDA_MPS_LOG_DIRECTORY"], exist_ok=True)

        subprocess.Popen(
            ["nvidia-cuda-mps-control", "-d"],
            env=env,
        )

    def launch_model(self, model_id, model_path, sm_percentage):
        """Launch a model server process with an SM allocation."""
        env = os.environ.copy()
        env["CUDA_VISIBLE_DEVICES"] = str(self.gpu_id)
        env["CUDA_MPS_ACTIVE_THREAD_PERCENTAGE"] = str(sm_percentage)

        # Each model server is an independent process
        proc = subprocess.Popen(
            [
                "python", "model_server.py",
                "--model", model_path,
                "--model-id", model_id,
            ],
            env=env,
        )
        self.processes[model_id] = proc

Spatial Sharing Limitations

Spatial sharing has fundamental constraints:

1. Memory isolation. CUDA processes share GPU memory but there is no hardware memory protection between them. An out-of-memory allocation in one process can crash both. Production systems must carefully pre-allocate and limit memory per process.

2. SM interference. Even with MPS, two models share the L2 cache, memory controllers, and HBM bandwidth. A compute-heavy model will not slow the other model’s compute, but both compete for memory bandwidth.

3. Error isolation. A CUDA error (e.g., illegal memory access) in one process brings down the entire GPU context, killing all co-located models.

# Memory bandwidth contention measurement
def measure_bandwidth_contention():
    """
    When two models share a GPU, each gets less memory bandwidth.
    Memory-bound operations (decode) slow down.
    Compute-bound operations (large-batch prefill) are less affected.
    """
    # Bandwidth contention model:
    # Single model: BW_eff = BW_peak * utilization
    # Two models: BW_eff_per_model = BW_peak * utilization / contention_factor
    #
    # Measured contention factors on H100:
    contention = {
        "both_memory_bound": 1.8,   # each gets ~55% of peak BW
        "one_memory_one_compute": 1.2,  # memory-bound gets ~83%
        "both_compute_bound": 1.05,  # minimal contention
    }
    return contention
📊

Spatial Sharing Impact on Per-Model Performance (H100 80GB)

ConfigurationModel A ThroughputModel B ThroughputTotal Throughputvs Dedicated
Dedicated (A only) 3000 tok/s 3000 tok/s 1.00x
Dedicated (B only) 3200 tok/s 3200 tok/s 1.00x
Spatial 50/50 (decode) 1700 tok/s 1800 tok/s 3500 tok/s 0.56x each
Spatial 50/50 (prefill) 2600 tok/s 2800 tok/s 5400 tok/s 0.87x each
Spatial 70/30 (A heavy) 2400 tok/s 900 tok/s 3300 tok/s varies
Note: Both models are 8B FP8. Decode is memory-bandwidth-bound, so sharing hurts more. Prefill is compute-bound, so sharing has less impact. Total throughput exceeds single-model because both models together utilize more of the GPU's resources.
⚠️ Spatial Sharing is Fragile

Spatial sharing works well in controlled benchmarks but is difficult to operate in production. The lack of memory isolation means any model can OOM and crash its neighbor. The performance interference is workload-dependent and hard to predict. Most production systems prefer temporal sharing (model switching) or adapter pools over spatial sharing. MPS-based spatial sharing is primarily used when two models have complementary access patterns (one compute-bound, one memory-bound) and the operator accepts the operational risk.

Adapter Pools: Base Model + N LoRA Adapters

The most memory-efficient multi-model strategy: load one base model and swap lightweight LoRA adapters for each customer or task.

LoRA Memory Math

A LoRA adapter modifies a weight matrix WRd×kW \in \mathbb{R}^{d \times k} with a low-rank update:

W=W+αBAW' = W + \alpha \cdot B \cdot A

where ARr×kA \in \mathbb{R}^{r \times k} and BRd×rB \in \mathbb{R}^{d \times r}, with rank rmin(d,k)r \ll \min(d, k).

Memory per adapter:

adapter size=(d×r+r×k)×dtype_size×num_adapted_layers\text{adapter size} = (d \times r + r \times k) \times \text{dtype\_size} \times \text{num\_adapted\_layers}

For Llama 70B with rank 64, adapting all attention projections (Q, K, V, O per layer, 80 layers):

  • Per layer: (8192×64+64×8192)×2×4=8.4 MB(8192 \times 64 + 64 \times 8192) \times 2 \times 4 = 8.4 \text{ MB} (4 projections)
  • Total: 8.4×80=671 MB8.4 \times 80 = 671 \text{ MB}

Compare to the base model: 140 GB. The adapter is 0.48% of the base model size. You can store 100 adapters in 67 GB — less than one copy of the base model.

S-LoRA: Scalable LoRA Serving

S-LoRA (Sheng et al., 2023) addresses the key challenge of serving many LoRA adapters concurrently: different requests in the same batch may require different adapters.

class SLoRAManager:
    """
    S-LoRA adapter pool manager.
    Handles adapter loading, batched LoRA computation, and adapter scheduling.
    """

    def __init__(self, base_model, max_adapters_gpu=20, max_adapters_cpu=200):
        self.base_model = base_model
        self.hidden_dim = base_model.config.hidden_size
        self.num_layers = base_model.config.num_hidden_layers

        # Adapter storage
        self.gpu_adapters = {}   # adapter_id -> {layer_id -> (A, B)} on GPU
        self.cpu_adapters = {}   # adapter_id -> {layer_id -> (A, B)} on CPU
        self.adapter_metadata = {}  # adapter_id -> {rank, alpha, size_bytes}

        # LRU for GPU adapter cache
        self.gpu_access_order = []
        self.max_gpu_adapters = max_adapters_gpu
        self.max_cpu_adapters = max_adapters_cpu

    def load_adapter(self, adapter_id, adapter_path):
        """Load a LoRA adapter from disk into CPU cache."""
        adapter_weights = torch.load(adapter_path, weights_only=True)

        # Parse LoRA weights into per-layer (A, B) pairs
        parsed = {}
        rank = None
        for key, tensor in adapter_weights.items():
            # Expected format: layers.{i}.{proj}.lora_{a|b}.weight
            parts = key.split(".")
            layer_idx = int(parts[1])
            proj_name = parts[2]
            lora_part = parts[3]  # "lora_a" or "lora_b"

            if layer_idx not in parsed:
                parsed[layer_idx] = {}
            if proj_name not in parsed[layer_idx]:
                parsed[layer_idx][proj_name] = {}

            parsed[layer_idx][proj_name][lora_part] = tensor

            if lora_part == "lora_a" and rank is None:
                rank = tensor.shape[0]

        self.cpu_adapters[adapter_id] = parsed
        total_size = sum(t.nbytes for t in adapter_weights.values())
        self.adapter_metadata[adapter_id] = {
            "rank": rank,
            "size_bytes": total_size,
        }

    def ensure_on_gpu(self, adapter_id):
        """Ensure adapter is on GPU, loading from CPU if necessary."""
        if adapter_id in self.gpu_adapters:
            self._touch_gpu(adapter_id)
            return

        # Evict LRU adapter from GPU if at capacity
        while len(self.gpu_adapters) >= self.max_gpu_adapters:
            self._evict_gpu_lru()

        # Transfer from CPU to GPU
        cpu_adapter = self.cpu_adapters[adapter_id]
        gpu_adapter = {}
        for layer_idx, projs in cpu_adapter.items():
            gpu_adapter[layer_idx] = {}
            for proj_name, parts in projs.items():
                gpu_adapter[layer_idx][proj_name] = {
                    k: v.cuda(non_blocking=True)
                    for k, v in parts.items()
                }

        self.gpu_adapters[adapter_id] = gpu_adapter
        self.gpu_access_order.append(adapter_id)

    def _touch_gpu(self, adapter_id):
        if adapter_id in self.gpu_access_order:
            self.gpu_access_order.remove(adapter_id)
        self.gpu_access_order.append(adapter_id)

    def _evict_gpu_lru(self):
        if not self.gpu_access_order:
            return
        victim = self.gpu_access_order.pop(0)
        del self.gpu_adapters[victim]
        torch.cuda.empty_cache()

def batched_lora_forward(
    hidden_states,
    base_weight,
    adapter_assignments,
    adapter_pool,
    layer_idx,
    proj_name,
):
    """
    Forward pass with multiple LoRA adapters in one batch.

    Args:
        hidden_states: [total_tokens, hidden_dim]
        base_weight: [hidden_dim, output_dim]
        adapter_assignments: list of (start_idx, end_idx, adapter_id)
            Each entry says tokens[start:end] use adapter_id
        adapter_pool: SLoRAManager with adapters on GPU
        layer_idx: which transformer layer
        proj_name: which projection (q, k, v, o)
    """
    # Step 1: Base model computation (same for all tokens)
    base_output = hidden_states @ base_weight  # [total_tokens, output_dim]

    # Step 2: LoRA delta computation (per-adapter)
    # Option A: Loop over adapters (simple but sequential)
    for start_idx, end_idx, adapter_id in adapter_assignments:
        if adapter_id is None:
            continue  # no adapter for these tokens

        adapter = adapter_pool.gpu_adapters[adapter_id][layer_idx][proj_name]
        A = adapter["lora_a"]  # [rank, hidden_dim]
        B = adapter["lora_b"]  # [output_dim, rank]

        # Tokens for this adapter
        x = hidden_states[start_idx:end_idx]  # [n_tokens, hidden_dim]

        # LoRA computation: x @ A^T @ B^T
        delta = x @ A.T @ B.T  # [n_tokens, output_dim]
        base_output[start_idx:end_idx] += delta

    return base_output

SGMV: Custom Kernels for Batched LoRA

The per-adapter loop in batched_lora_forward is inefficient because each LoRA computation is a small GEMM that underutilizes the GPU. S-LoRA uses a custom CUDA kernel called SGMV (Segmented Gather Matrix-Vector) to batch all LoRA computations into one kernel:

def sgmv_batched_lora(
    hidden_states,
    lora_a_weights,   # list of A matrices, one per unique adapter
    lora_b_weights,   # list of B matrices, one per unique adapter
    segment_starts,   # start index of each adapter segment
    segment_lengths,  # number of tokens per segment
    adapter_indices,  # which adapter each segment uses
):
    """
    SGMV kernel: compute all LoRA deltas in one kernel launch.

    Instead of N separate small GEMMs (one per adapter), we launch one
    kernel that handles all adapters. The kernel:
    1. Each thread block processes one segment
    2. Loads the appropriate A and B matrices based on adapter_index
    3. Computes the LoRA delta for that segment
    4. Writes result to the correct output location

    This is ~5x faster than the loop approach for 10+ concurrent adapters.
    """
    # In practice, this calls a custom CUDA kernel:
    # punica.ops.sgmv(hidden_states, lora_a_weights, lora_b_weights,
    #                  segment_starts, segment_lengths, adapter_indices)
    pass
📊

LoRA Serving Performance: Loop vs SGMV (H100, Llama 70B base, 20 adapters)

MethodThroughput (tok/s)LoRA Overhead vs BaseMax Concurrent Adapters
Base model only (no LoRA) 3000 0% 0
Naive loop (sequential GEMMs) 2100 30% 5-10
SGMV batched kernel 2750 8% 50-100
Merged adapters (offline) 2950 2% 1 (premerged)
Note: Merged adapters pre-compute W' = W + alpha * B * A offline. Zero runtime overhead but only works for one adapter at a time. SGMV handles many concurrent adapters with modest overhead.

Adapter Scheduling

With hundreds of adapters and requests arriving in real time, the scheduler must decide which adapters to keep on GPU:

class AdapterScheduler:
    """
    Schedule which LoRA adapters to keep on GPU based on traffic patterns.
    """

    def __init__(self, adapter_pool, max_gpu_adapters=20):
        self.adapter_pool = adapter_pool
        self.max_gpu_adapters = max_gpu_adapters
        self.request_counts = {}  # adapter_id -> recent request count
        self.window_size = 100    # count requests in last 100 iterations

    def update_traffic(self, batch_adapter_ids):
        """Update traffic statistics from the current batch."""
        for adapter_id in batch_adapter_ids:
            self.request_counts[adapter_id] = (
                self.request_counts.get(adapter_id, 0) + 1
            )

    def select_gpu_adapters(self, pending_requests):
        """
        Decide which adapters should be on GPU for the next iteration.
        Strategy: prioritize adapters needed by pending requests,
        then keep popular adapters.
        """
        # Must-have: adapters needed by currently running requests
        needed_now = set()
        for req in pending_requests:
            if req.adapter_id is not None:
                needed_now.add(req.adapter_id)

        # Nice-to-have: popular adapters likely to be needed soon
        popular = sorted(
            self.request_counts.keys(),
            key=lambda a: self.request_counts[a],
            reverse=True,
        )

        selected = list(needed_now)
        remaining_slots = self.max_gpu_adapters - len(selected)

        for adapter_id in popular:
            if remaining_slots <= 0:
                break
            if adapter_id not in needed_now:
                selected.append(adapter_id)
                remaining_slots -= 1

        # Ensure selected adapters are on GPU
        for adapter_id in selected:
            self.adapter_pool.ensure_on_gpu(adapter_id)

        return selected
ℹ️ Adapter Prefetching

If the system can predict which adapters will be needed (e.g., based on user session data or API key mapping), it can prefetch adapters to GPU before requests arrive. This eliminates the 15ms adapter swap latency entirely. In practice, most adapter accesses follow a Zipf distribution — a small number of popular adapters handle most traffic, and they stay hot in the GPU cache permanently.

Traffic-Based Routing

With multiple GPUs each potentially hosting different models or adapters, the router must send each request to a GPU that already has the right model loaded.

Affinity-Based Routing

import hashlib
from collections import defaultdict

class ModelAffinityRouter:
    """
    Route requests to GPUs based on model affinity.
    Goal: minimize model switches by sending requests for the same
    model to the same GPU.
    """

    def __init__(self, gpu_ids):
        self.gpu_ids = gpu_ids
        self.gpu_models = {}     # gpu_id -> currently loaded model_id
        self.gpu_adapters = defaultdict(set)  # gpu_id -> set of loaded adapter_ids
        self.gpu_queue_depth = defaultdict(int)  # gpu_id -> pending requests
        self.model_to_gpus = defaultdict(set)  # model_id -> set of gpu_ids

    def route_request(self, model_id, adapter_id=None):
        """
        Select the best GPU for this request.

        Priority:
        1. GPU with correct model AND adapter already loaded (zero switch cost)
        2. GPU with correct model loaded (adapter swap: ~15ms)
        3. GPU with model in warm cache (warm switch: ~2.5s)
        4. Least loaded GPU (cold switch: ~23s)
        """
        # Priority 1: exact match (model + adapter)
        if adapter_id:
            exact_matches = [
                gpu_id for gpu_id in self.gpu_ids
                if (self.gpu_models.get(gpu_id) == model_id
                    and adapter_id in self.gpu_adapters[gpu_id])
            ]
            if exact_matches:
                return self._select_least_loaded(exact_matches)

        # Priority 2: model match (adapter not loaded but model is)
        model_matches = [
            gpu_id for gpu_id in self.gpu_ids
            if self.gpu_models.get(gpu_id) == model_id
        ]
        if model_matches:
            return self._select_least_loaded(model_matches)

        # Priority 3: any GPU with warm cache for this model
        warm_matches = [
            gpu_id for gpu_id in self.gpu_ids
            if self._has_warm_cache(gpu_id, model_id)
        ]
        if warm_matches:
            return self._select_least_loaded(warm_matches)

        # Priority 4: least loaded GPU (will cold switch)
        return self._select_least_loaded(self.gpu_ids)

    def _select_least_loaded(self, candidates):
        """Among candidate GPUs, pick the one with fewest pending requests."""
        return min(candidates, key=lambda g: self.gpu_queue_depth[g])

    def _has_warm_cache(self, gpu_id, model_id):
        """Check if GPU has model weights in CPU memory (warm cache)."""
        # Implementation depends on warm cache integration
        return False  # placeholder

    def update_gpu_state(self, gpu_id, model_id, adapter_ids):
        """Called by GPU workers to report their current state."""
        old_model = self.gpu_models.get(gpu_id)
        if old_model and old_model != model_id:
            self.model_to_gpus[old_model].discard(gpu_id)

        self.gpu_models[gpu_id] = model_id
        self.gpu_adapters[gpu_id] = set(adapter_ids)
        self.model_to_gpus[model_id].add(gpu_id)

Consistent Hashing for Sticky Routing

For stateless routing (where the router does not track GPU state), consistent hashing provides model-GPU affinity:

import bisect

class ConsistentHashRouter:
    """
    Consistent hashing for model-to-GPU routing.
    Models with the same ID always route to the same GPU subset.
    Adding/removing GPUs minimally disrupts routing.
    """

    def __init__(self, gpu_ids, virtual_nodes=150):
        self.ring = []  # sorted list of (hash, gpu_id)
        self.gpu_ids = set(gpu_ids)

        for gpu_id in gpu_ids:
            for i in range(virtual_nodes):
                h = self._hash(f"{gpu_id}:{i}")
                self.ring.append((h, gpu_id))

        self.ring.sort(key=lambda x: x[0])
        self.ring_hashes = [h for h, _ in self.ring]

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def route(self, model_id, num_replicas=1):
        """
        Find GPUs for a model.
        Returns num_replicas GPU IDs for load balancing.
        """
        h = self._hash(model_id)
        idx = bisect.bisect_left(self.ring_hashes, h) % len(self.ring)

        selected = []
        seen = set()
        while len(selected) < num_replicas:
            _, gpu_id = self.ring[idx % len(self.ring)]
            if gpu_id not in seen:
                selected.append(gpu_id)
                seen.add(gpu_id)
            idx += 1

        return selected

Request Routing Hit Rate (model already loaded on target GPU)

(%)
Random routing
12 %
Round-robin
15 %
Consistent hashing
72 %
Affinity-based (stateful)
94 %
Affinity + predictive preload
98 %

Implementation: Model Registry with LRU Cache and Adapter Pool

Here is a complete model registry that ties together model switching, adapter management, and routing:

import time
import threading
from collections import OrderedDict
from dataclasses import dataclass, field
from enum import Enum

class ModelState(Enum):
    ON_GPU = "on_gpu"
    IN_CPU_CACHE = "in_cpu_cache"
    ON_DISK = "on_disk"
    LOADING = "loading"

@dataclass
class ModelEntry:
    model_id: str
    model_path: str
    weight_size_bytes: int
    dtype: str = "float16"

    # Runtime state
    state: ModelState = ModelState.ON_DISK
    gpu_id: int = -1
    last_access_time: float = 0.0
    request_count: int = 0
    cpu_state_dict: dict = field(default=None, repr=False)
    gpu_state_dict: dict = field(default=None, repr=False)

@dataclass
class AdapterEntry:
    adapter_id: str
    base_model_id: str
    adapter_path: str
    rank: int
    size_bytes: int

    # Runtime
    on_gpu: bool = False
    gpu_id: int = -1
    cpu_weights: dict = field(default=None, repr=False)
    gpu_weights: dict = field(default=None, repr=False)
    last_access_time: float = 0.0
    request_count: int = 0

class ModelRegistry:
    """
    Central registry for multi-model serving.
    Manages model lifecycle, adapter pools, and GPU allocation.
    """

    def __init__(
        self,
        gpu_ids,
        gpu_memory_gb=80,
        cpu_cache_gb=500,
        max_adapters_per_gpu=20,
    ):
        self.gpu_ids = gpu_ids
        self.gpu_memory_gb = gpu_memory_gb
        self.cpu_cache_gb = cpu_cache_gb
        self.max_adapters_per_gpu = max_adapters_per_gpu

        # Model registry
        self.models = {}  # model_id -> ModelEntry

        # Adapter registry
        self.adapters = {}  # adapter_id -> AdapterEntry

        # GPU state
        self.gpu_loaded_model = {}    # gpu_id -> model_id
        self.gpu_loaded_adapters = {  # gpu_id -> OrderedDict (LRU)
            gpu_id: OrderedDict() for gpu_id in gpu_ids
        }
        self.gpu_memory_used = {gpu_id: 0 for gpu_id in gpu_ids}

        # CPU LRU cache
        self.cpu_cache = OrderedDict()  # model_id -> state_dict
        self.cpu_cache_used = 0

        # Lock for thread safety
        self.lock = threading.Lock()

        # Metrics
        self.switch_count = 0
        self.cache_hits = 0
        self.cache_misses = 0

    def register_model(self, model_id, model_path, weight_size_gb, dtype="float16"):
        """Register a model in the catalog."""
        self.models[model_id] = ModelEntry(
            model_id=model_id,
            model_path=model_path,
            weight_size_bytes=int(weight_size_gb * 1024**3),
            dtype=dtype,
        )

    def register_adapter(self, adapter_id, base_model_id, adapter_path, rank):
        """Register a LoRA adapter."""
        # Estimate adapter size
        base = self.models[base_model_id]
        # Rough estimate: 2 * hidden_dim * rank * num_layers * num_projections * dtype_size
        hidden_dim = 8192 if "70b" in base_model_id.lower() else 4096
        num_layers = 80 if "70b" in base_model_id.lower() else 32
        num_projections = 4  # Q, K, V, O
        dtype_size = 2 if base.dtype == "float16" else 1
        size = 2 * hidden_dim * rank * num_layers * num_projections * dtype_size

        self.adapters[adapter_id] = AdapterEntry(
            adapter_id=adapter_id,
            base_model_id=base_model_id,
            adapter_path=adapter_path,
            rank=rank,
            size_bytes=size,
        )

    def get_model_for_request(self, model_id, adapter_id, gpu_id):
        """
        Ensure model and adapter are loaded on the specified GPU.
        Returns estimated switch latency.
        """
        with self.lock:
            switch_latency_ms = 0

            # Step 1: Ensure base model is loaded
            if self.gpu_loaded_model.get(gpu_id) != model_id:
                switch_latency_ms = self._switch_model(model_id, gpu_id)

            # Step 2: Ensure adapter is loaded (if requested)
            if adapter_id:
                adapter_latency = self._load_adapter(adapter_id, gpu_id)
                switch_latency_ms += adapter_latency

            # Update metrics
            model = self.models[model_id]
            model.last_access_time = time.time()
            model.request_count += 1

            if adapter_id:
                adapter = self.adapters[adapter_id]
                adapter.last_access_time = time.time()
                adapter.request_count += 1

            return switch_latency_ms

    def _switch_model(self, model_id, gpu_id):
        """Switch the model on a GPU. Returns latency in ms."""
        model = self.models[model_id]
        old_model_id = self.gpu_loaded_model.get(gpu_id)

        # Save old model to CPU cache if present
        if old_model_id:
            self._save_to_cpu_cache(old_model_id, gpu_id)

        # Check CPU cache for new model
        if model_id in self.cpu_cache:
            # Warm switch
            self.cache_hits += 1
            state_dict = self.cpu_cache[model_id]
            latency_ms = self._transfer_cpu_to_gpu(state_dict, gpu_id)
            self.gpu_loaded_model[gpu_id] = model_id
            model.state = ModelState.ON_GPU
            model.gpu_id = gpu_id
            self.switch_count += 1
            return latency_ms
        else:
            # Cold switch
            self.cache_misses += 1
            state_dict = self._load_from_disk(model.model_path)
            latency_ms = self._transfer_cpu_to_gpu(state_dict, gpu_id)

            # Also cache in CPU memory
            self._add_to_cpu_cache(model_id, state_dict)

            self.gpu_loaded_model[gpu_id] = model_id
            model.state = ModelState.ON_GPU
            model.gpu_id = gpu_id
            self.switch_count += 1
            return latency_ms + self._estimate_disk_load_ms(model.weight_size_bytes)

    def _load_adapter(self, adapter_id, gpu_id):
        """Load an adapter onto GPU. Returns latency in ms."""
        adapter_cache = self.gpu_loaded_adapters[gpu_id]

        if adapter_id in adapter_cache:
            # Already loaded, move to end (most recently used)
            adapter_cache.move_to_end(adapter_id)
            return 0.0  # no latency

        # Need to load adapter
        adapter = self.adapters[adapter_id]

        # Evict LRU adapter if at capacity
        while len(adapter_cache) >= self.max_adapters_per_gpu:
            evicted_id, _ = adapter_cache.popitem(last=False)
            evicted = self.adapters[evicted_id]
            evicted.on_gpu = False
            evicted.gpu_weights = None

        # Load adapter weights
        weights = torch.load(adapter.adapter_path, weights_only=True)
        gpu_weights = {k: v.cuda(gpu_id) for k, v in weights.items()}

        adapter_cache[adapter_id] = gpu_weights
        adapter.on_gpu = True
        adapter.gpu_id = gpu_id
        adapter.gpu_weights = gpu_weights

        # Adapter transfer: typically 0.4-1.5 GB at ~50 GB/s
        latency_ms = (adapter.size_bytes / (50 * 1024**3)) * 1000
        return latency_ms

    def _save_to_cpu_cache(self, model_id, gpu_id):
        """Save GPU model weights to CPU cache."""
        # In real implementation, copy weights from GPU to pinned CPU memory
        # For now, just track the cache state
        model = self.models[model_id]
        model.state = ModelState.IN_CPU_CACHE

    def _add_to_cpu_cache(self, model_id, state_dict):
        """Add model to CPU LRU cache, evicting if necessary."""
        model = self.models[model_id]
        size = model.weight_size_bytes

        # Evict LRU entries until space is available
        while (
            self.cpu_cache_used + size > self.cpu_cache_gb * 1024**3
            and self.cpu_cache
        ):
            evicted_id, evicted_dict = self.cpu_cache.popitem(last=False)
            evicted = self.models[evicted_id]
            self.cpu_cache_used -= evicted.weight_size_bytes
            evicted.state = ModelState.ON_DISK
            del evicted_dict

        self.cpu_cache[model_id] = state_dict
        self.cpu_cache_used += size
        model.state = ModelState.IN_CPU_CACHE

    def _load_from_disk(self, model_path):
        """Load model weights from disk."""
        return torch.load(model_path, map_location="cpu", weights_only=True)

    def _transfer_cpu_to_gpu(self, state_dict, gpu_id):
        """Transfer weights from CPU to GPU. Returns estimated latency in ms."""
        total_bytes = sum(t.nbytes for t in state_dict.values())
        # PCIe 5.0 with pinned memory: ~55 GB/s
        return (total_bytes / (55 * 1024**3)) * 1000

    def _estimate_disk_load_ms(self, size_bytes):
        """Estimate disk load time."""
        # NVMe Gen4 x4: ~7 GB/s
        return (size_bytes / (7 * 1024**3)) * 1000

    def get_stats(self):
        """Return registry statistics."""
        return {
            "total_models": len(self.models),
            "total_adapters": len(self.adapters),
            "models_on_gpu": sum(
                1 for m in self.models.values() if m.state == ModelState.ON_GPU
            ),
            "models_in_cpu_cache": sum(
                1 for m in self.models.values() if m.state == ModelState.IN_CPU_CACHE
            ),
            "adapters_on_gpu": sum(
                len(adapters) for adapters in self.gpu_loaded_adapters.values()
            ),
            "model_switches": self.switch_count,
            "cache_hit_rate": (
                self.cache_hits / (self.cache_hits + self.cache_misses)
                if (self.cache_hits + self.cache_misses) > 0
                else 0.0
            ),
            "gpu_utilization": {
                gpu_id: self.gpu_loaded_model.get(gpu_id, "empty")
                for gpu_id in self.gpu_ids
            },
        }

# Usage example
def setup_multi_model_cluster():
    registry = ModelRegistry(
        gpu_ids=list(range(8)),
        gpu_memory_gb=80,
        cpu_cache_gb=500,
        max_adapters_per_gpu=20,
    )

    # Register base models
    registry.register_model("llama-8b", "/models/llama-8b", weight_size_gb=16)
    registry.register_model("llama-70b", "/models/llama-70b", weight_size_gb=140)
    registry.register_model("mistral-7b", "/models/mistral-7b", weight_size_gb=14)

    # Register adapters
    for i in range(100):
        registry.register_adapter(
            adapter_id=f"customer_{i}_adapter",
            base_model_id="llama-70b",
            adapter_path=f"/adapters/customer_{i}.pt",
            rank=64,
        )

    # Simulate request routing
    requests = [
        ("llama-70b", "customer_0_adapter", 0),
        ("llama-70b", "customer_1_adapter", 0),
        ("llama-8b", None, 1),
        ("llama-70b", "customer_0_adapter", 2),
        ("mistral-7b", None, 3),
    ]

    for model_id, adapter_id, gpu_id in requests:
        latency = registry.get_model_for_request(model_id, adapter_id, gpu_id)
        print(f"Model={model_id}, Adapter={adapter_id}, GPU={gpu_id}, "
              f"Switch latency={latency:.1f}ms")

    print(registry.get_stats())

Cost Optimization: When to Use Each Strategy

The choice between temporal sharing, spatial sharing, and adapter pools depends on the traffic pattern:

📊

Multi-Model Strategy Decision Matrix

StrategyBest WhenSwitch CostMemory EfficiencyOperational Complexity
Temporal (cold) Rare model switches, batch jobs 10-25s Excellent (1 model per GPU) Low
Temporal (warm) Moderate switch frequency 2-3s Good (CPU memory needed) Medium
Temporal (overlapped) Predictable traffic patterns 200ms Poor (2x GPU memory during switch) High
Spatial (MPS) Two complementary models 0 Moderate (memory split) High
Adapter pool Many variants of one base 15ms Excellent (adapters are tiny) Medium
Note: Adapter pools are the clear winner when the model variants are LoRA fine-tunes of the same base. Temporal sharing is necessary when base models differ. Spatial sharing is a niche optimization for specific workloads.

Cost Model

The total cost of serving NN models on GG GPUs:

Cost=G×cost_per_GPU_hour+switch_penalty\text{Cost} = G \times \text{cost\_per\_GPU\_hour} + \text{switch\_penalty}

where the switch penalty is:

switch_penalty=switchestswitch×lost_throughput\text{switch\_penalty} = \sum_{\text{switches}} t_{\text{switch}} \times \text{lost\_throughput}

For a cluster doing 100 model switches per hour with warm switching (2.5s each):

switch_penalty=100×2.5×3000 tok/s=750,000 tokens lost/hour\text{switch\_penalty} = 100 \times 2.5 \times 3000 \text{ tok/s} = 750{,}000 \text{ tokens lost/hour}

At \0.003per1Ktokens,thatisper 1K tokens, that is$2.25/hourinlostrevenueperGPUthatswitches.Fora64GPUcluster,/hour in lost revenue per GPU that switches. For a 64-GPU cluster, $144/hourinswitchoverhead.Reducingswitchtimefrom2.5s(warm)to0.2s(overlapped)cutsthisto/hour in switch overhead. Reducing switch time from 2.5s (warm) to 0.2s (overlapped) cuts this to $11.50$/hour.

With adapter pools, 100 “switches” per hour cost:

100×0.015×3000=4,500 tokens lost/hour=$0.014/hour100 \times 0.015 \times 3000 = 4{,}500 \text{ tokens lost/hour} = \$0.014/\text{hour}

Hourly Cost of Model Switching (100 switches/hour, 3000 tok/s per GPU)

($/hour)
Cold switch (23s per switch)
20.7 $/hour
Warm switch (2.5s per switch)
2.25 $/hour
Overlapped switch (0.2s per switch)
0.18 $/hour
Adapter swap (15ms per switch)
0.014 $/hour
💡 Architecture Decision

If your multi-model requirements can be served by LoRA adapters (same base model, task-specific customization), always prefer adapter pools. The memory overhead is negligible (0.5% per adapter), switch time is 15ms, and SGMV kernels make concurrent adapter serving efficient. Reserve temporal model switching for fundamentally different base models (e.g., switching between 8B and 70B, or between Llama and Mistral).

Monitoring and Autoscaling

A multi-model cluster needs monitoring to detect when to scale or rebalance:

import time
from collections import defaultdict

class MultiModelMonitor:
    """Monitor multi-model serving cluster health."""

    def __init__(self, registry):
        self.registry = registry
        self.model_latencies = defaultdict(list)   # model_id -> [latencies]
        self.switch_events = []                     # (timestamp, gpu_id, from, to)
        self.adapter_swap_events = []
        self.queue_depths = defaultdict(list)       # model_id -> [queue_depths]

    def record_request(self, model_id, adapter_id, latency_ms, gpu_id):
        """Record a completed request."""
        self.model_latencies[model_id].append(latency_ms)

    def record_switch(self, gpu_id, from_model, to_model, switch_time_ms):
        """Record a model switch event."""
        self.switch_events.append({
            "timestamp": time.time(),
            "gpu_id": gpu_id,
            "from_model": from_model,
            "to_model": to_model,
            "switch_time_ms": switch_time_ms,
        })

    def get_scaling_recommendations(self):
        """
        Analyze metrics and recommend scaling actions.
        """
        recommendations = []

        # Check switch frequency per GPU
        now = time.time()
        recent_switches = [
            s for s in self.switch_events
            if now - s["timestamp"] < 3600  # last hour
        ]

        switch_rate_per_gpu = defaultdict(int)
        for s in recent_switches:
            switch_rate_per_gpu[s["gpu_id"]] += 1

        for gpu_id, count in switch_rate_per_gpu.items():
            if count > 50:  # more than 50 switches/hour
                model = self.registry.gpu_loaded_model.get(gpu_id)
                recommendations.append({
                    "action": "add_dedicated_gpu",
                    "reason": f"GPU {gpu_id} switching {count}x/hour",
                    "target_model": model,
                    "estimated_savings_per_hour": count * 2.5 * 0.003,
                })

        # Check queue depth per model
        for model_id, depths in self.queue_depths.items():
            if depths and sum(depths[-10:]) / 10 > 50:
                recommendations.append({
                    "action": "scale_up",
                    "reason": f"Model {model_id} avg queue depth above 50",
                    "target_model": model_id,
                    "suggested_additional_gpus": 1,
                })

        # Check adapter cache hit rate
        stats = self.registry.get_stats()
        if stats["cache_hit_rate"] < 0.7:
            recommendations.append({
                "action": "increase_cpu_cache",
                "reason": f"Cache hit rate {stats['cache_hit_rate']:.1%} below 70%",
                "suggested_increase_gb": 100,
            })

        return recommendations

Autoscaling Triggers

# Autoscaling policy (simplified):
autoscale_config = {
    "scale_up_triggers": [
        {
            "metric": "p99_latency_ms",
            "threshold": 200,
            "window_minutes": 5,
            "action": "add_gpu_for_model",
        },
        {
            "metric": "switch_rate_per_gpu_per_hour",
            "threshold": 30,
            "window_minutes": 60,
            "action": "add_dedicated_gpu",
        },
        {
            "metric": "adapter_cache_miss_rate",
            "threshold": 0.3,
            "window_minutes": 15,
            "action": "increase_adapter_cache",
        },
    ],
    "scale_down_triggers": [
        {
            "metric": "gpu_utilization_percent",
            "threshold": 20,
            "window_minutes": 30,
            "action": "consolidate_models",
        },
    ],
    "cooldown_minutes": 10,
}