Part of Series Inference Optimization Timeline 9 of 23
1 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 2 KV Cache: The Hidden Memory Giant in LLM Serving 3 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 4 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 5 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 6 Continuous Batching: The Complete Guide to LLM Inference Scheduling 7 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 8 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 9 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 10 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 11 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 12 Mamba and State Space Models: The O(n) Alternative to Attention 13 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 14 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 15 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 16 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 17 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 18 Memory Pool Management: Slab Allocators for GPU Inference 19 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 20 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 21 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 22 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 23 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification

Fine-tuning gives each customer a model that speaks their language, follows their conventions, and produces output tailored to their domain. But the economics of serving fine-tuned models are brutal. If you have 1,000 customers, each with their own fine-tuned 70B model, you need 1,000 copies of the model weights — roughly 35 PB of GPU memory at FP16. That is absurd. Nobody does this.

LoRA (Low-Rank Adaptation) changes the equation entirely. Instead of modifying all 70 billion parameters, LoRA trains a tiny adapter — a pair of low-rank matrices — that modifies the model’s behavior while leaving the base weights untouched. The adapter for a 70B model might be 50-200 MB instead of 140 GB. One thousand adapters fit in 50-200 GB of storage, not 140 TB. But serving these adapters efficiently at inference time introduces a new set of systems challenges that are distinct from the training problem.

This post covers the full serving story: the LoRA math and why low-rank works, QLoRA’s memory optimization, the multi-adapter serving challenge, S-LoRA’s architecture for serving thousands of adapters, when to merge adapters into the base model, rank selection tradeoffs, vLLM and SGLang multi-LoRA support, and production deployment patterns.


1. LoRA: Low-Rank Adaptation

The Core Math

Standard fine-tuning updates the full weight matrix WRd×dW \in \mathbb{R}^{d \times d} (or d×kd \times k for non-square layers). LoRA constrains the update to a low-rank decomposition:

W=W+ΔW=W+BAW' = W + \Delta W = W + BA

where BRd×rB \in \mathbb{R}^{d \times r} and ARr×dA \in \mathbb{R}^{r \times d}, with rank rdr \ll d. Only AA and BB are trained; WW is frozen.

The parameter count comparison is dramatic. For a weight matrix in Llama-70B with d=8192d = 8192:

  • Full fine-tuning: d×d=81922=67,108,864d \times d = 8192^2 = 67{,}108{,}864 parameters per matrix
  • LoRA with r=16r = 16: (d×r)+(r×d)=2×8192×16=262,144(d \times r) + (r \times d) = 2 \times 8192 \times 16 = 262{,}144 parameters per matrix

That is a 256x reduction in trainable parameters per adapted layer. Across the full model with LoRA applied to the QKV and output projection matrices in each attention layer:

LoRA parameters=4×2×d×r×L\text{LoRA parameters} = 4 \times 2 \times d \times r \times L

For Llama-70B (d=8192d = 8192, L=80L = 80 layers, r=16r = 16):

LoRA parameters=4×2×8192×16×80=83,886,08084M parameters\text{LoRA parameters} = 4 \times 2 \times 8192 \times 16 \times 80 = 83{,}886{,}080 \approx 84\text{M parameters}

At FP16, that is ~168 MB per adapter, compared to ~140 GB for the full model. You can store 833 adapters in the same memory as one full model copy.

📊

Parameter and Memory Comparison: Full Fine-Tune vs. LoRA

MethodTrainable ParamsAdapter Size (FP16)Adapters per 140 GBTraining Memory
Full fine-tune (70B) 70B 140 GB 1 ~280 GB (AdamW states)
LoRA r=4 (70B) 21M 42 MB 3,333 ~42 GB
LoRA r=16 (70B) 84M 168 MB 833 ~44 GB
LoRA r=64 (70B) 336M 672 MB 208 ~50 GB
LoRA r=256 (70B) 1.34B 2.68 GB 52 ~65 GB
Note: Adapter size = trainable parameters x 2 bytes (FP16). Training memory includes base model (FP16) + optimizer states for LoRA params only.

Why Low-Rank Works: The Intrinsic Dimensionality Hypothesis

LoRA’s effectiveness seems surprising. How can a rank-16 update — which can only express a 16-dimensional subspace of changes — capture the difference between a general-purpose model and a domain-specific one?

The answer lies in a key empirical observation: the weight updates during fine-tuning occupy a low-dimensional subspace. Aghajanyan et al. (2020) demonstrated that pre-trained language models have a low “intrinsic dimensionality” — the optimization landscape for fine-tuning is effectively much lower-dimensional than the parameter count suggests.

Concretely, if you take the full fine-tuning update ΔWfull=Wfine-tunedWpre-trained\Delta W_{\text{full}} = W_{\text{fine-tuned}} - W_{\text{pre-trained}} and compute its singular value decomposition:

ΔWfull=UΣVT\Delta W_{\text{full}} = U \Sigma V^T

The singular values σ1σ2σd\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_d decay rapidly. The top 16-64 singular values capture 90-99% of the total variance (measured by iσi2\sum_i \sigma_i^2). The fine-tuning update is empirically low-rank.

This means BABA with r=16r = 16 can approximate ΔWfull\Delta W_{\text{full}} well. The training process finds the rank-rr approximation that best fits the task, without needing to explicitly compute the SVD of the full update.

ℹ️ Initialization Matters

LoRA initializes AA with a random Gaussian and BB with zeros, so that ΔW=BA=0\Delta W = BA = 0 at the start of training. This ensures the adapted model begins identical to the base model and gradually learns the task-specific update. The scaling factor α/r\alpha / r controls the magnitude of the update, where α\alpha is a hyperparameter (typically set equal to rr or 2r2r).

LoRA During Inference: The Compute Cost

During inference, the adapted forward pass for a linear layer becomes:

y=Wx=(W+BA)x=Wx+BAxy = W'x = (W + BA)x = Wx + BAx

The extra compute for the LoRA adapter is:

  1. AxAx: matrix-vector multiply, r×dr \times d matrix times dd-vector = O(rd)O(rd) FLOPs
  2. B(Ax)B(Ax): matrix-vector multiply, d×rd \times r matrix times rr-vector = O(dr)O(dr) FLOPs

Total extra: 2rd2rd FLOPs per token per adapted layer. For r=16r = 16 and d=8192d = 8192:

Extra FLOPs per layer=2×16×8192=262,144\text{Extra FLOPs per layer} = 2 \times 16 \times 8192 = 262{,}144

Compare this to the base layer cost of 2d2=2×81922=134,217,7282d^2 = 2 \times 8192^2 = 134{,}217{,}728 FLOPs. The LoRA overhead is:

2rd2d2=rd=168192=0.2%\frac{2rd}{2d^2} = \frac{r}{d} = \frac{16}{8192} = 0.2\%

LoRA Inference Overhead by Rank (Llama-70B, per layer)

(% extra FLOPs)
r=4 Negligible
0.05 % extra FLOPs
r=16 Negligible
0.2 % extra FLOPs
r=64 Minimal
0.8 % extra FLOPs
r=128 Small
1.6 % extra FLOPs
r=256 Noticeable
3.1 % extra FLOPs
r=512 Significant
6.3 % extra FLOPs

At rank 16, the LoRA overhead is less than a rounding error in wall-clock time. Even at rank 256, the overhead is only 3.1% — well within the noise of other system-level variations. This is what makes LoRA practical for serving: the compute cost of personalization is nearly zero.


2. QLoRA: Quantized Base + Full-Precision Adapters

QLoRA (Dettmers et al., 2023) extends LoRA by quantizing the base model to 4-bit precision while keeping the LoRA adapters in FP16 (or BFloat16). This drastically reduces the memory required for the base model, enabling fine-tuning and serving of larger models on fewer GPUs.

Memory Arithmetic

For Llama-70B:

  • FP16 base model: 70B parameters x 2 bytes = 140 GB
  • 4-bit quantized base (NF4): 70B parameters x 0.5 bytes = 35 GB
  • LoRA adapters (FP16, r=16): ~84M parameters x 2 bytes = ~168 MB
  • Total QLoRA serving: 35 GB + 168 MB = ~35.2 GB

This fits on a single A100-80GB with 45 GB remaining for KV cache and prefix caching — versus 140 GB for the FP16 model which requires at least 2 GPUs.

GPU Memory Layout: QLoRA Serving (Llama-70B, A100-80GB)

0x2300 0x0000
0x2310 0x2300
0x2400 0x2310
0x5000 0x2400
0x5000 0x5000
4-bit Base Model (NF4) 35 GB
LoRA Adapters (FP16) 168 MB
Adapter Pool (hot) 1-5 GB
KV Cache + Prefix Cache ~40 GB
Remaining ~3 GB
70B parameters in NormalFloat4 quantization
Active adapter, r=16, applied to QKV+O projections
Next most popular adapters, ready for swapping
Active request KV storage + cached prefixes
CUDA overhead, activations, scheduling buffers
4-bit Base Model (NF4) 35 GB
LoRA Adapters (FP16) 168 MB
Adapter Pool (hot) 1-5 GB
KV Cache + Prefix Cache ~40 GB
Remaining ~3 GB

NF4 Quantization

QLoRA uses NormalFloat4 (NF4) quantization, which is information-theoretically optimal for normally distributed weights. The key insight is that pre-trained neural network weights are approximately normally distributed, so a uniform quantization grid wastes precision in the tails. NF4 spaces quantization levels according to the normal distribution’s quantiles, placing more levels near zero (where most weights cluster) and fewer in the tails.

The quantization process:

  1. Normalize: divide weights by their absmax value within a block (typically 64 elements)
  2. Map: find the nearest NF4 quantization level (16 levels for 4-bit)
  3. Store: 4 bits per weight + one FP32 absmax scale per block

The block-wise scaling adds overhead: one FP32 (4 bytes) per 64 weights = 0.0625 bytes/weight extra. Total storage per weight: 0.5 + 0.0625 = 0.5625 bytes. For 70B parameters: ~39.4 GB. In practice, QLoRA also uses “double quantization” (quantizing the block scales themselves to FP8), reducing the overhead further.

QLoRA Inference Quality

The quality impact of 4-bit quantization depends on the task and the model size. Larger models are more robust to quantization because each individual weight contributes less to the output.

📊

Quality Impact: FP16 vs. QLoRA 4-bit (Llama-70B)

BenchmarkFP16 Base4-bit NF4 BaseFP16 + LoRA r=164-bit + LoRA r=16
MMLU (5-shot) 68.9 68.2 (-0.7) 72.1 71.5 (-0.6)
HellaSwag 87.3 86.8 (-0.5) 88.5 88.0 (-0.5)
HumanEval 32.9 31.4 (-1.5) 42.1 40.8 (-1.3)
GSM8K 56.8 54.2 (-2.6) 64.3 62.1 (-2.2)
Average degradation - -1.3% - -1.2%
Note: LoRA fine-tuned on domain-specific data. Quality loss is relative to FP16 counterpart.

The average quality degradation from 4-bit quantization is ~1.3%, which is acceptable for most production use cases. For tasks requiring maximum quality (e.g., medical or legal applications), FP16 serving may still be preferred, at the cost of 4x more GPU memory.

The QLoRA Sweet Spot

QLoRA’s biggest impact is on total cost of ownership: serving a 70B model on 1 GPU instead of 2-4 GPUs cuts hardware costs by 50-75%. The 1-2% quality degradation from 4-bit quantization is a cheap price for halving your GPU fleet. For most production workloads, QLoRA dominates FP16 on cost-per-quality.


3. The Multi-Adapter Serving Challenge

The LoRA story so far is clean: tiny adapters, negligible compute overhead, massive memory savings. But production deployments introduce a systems problem that pure math cannot solve.

The Problem: 1,000 Customers, 1,000 Adapters

Consider an enterprise LLM platform serving 1,000 customers. Each customer has fine-tuned a LoRA adapter for their domain: legal documents, medical records, financial reports, customer support, code generation, and so on. Each request arrives with a customer ID that maps to a specific adapter.

Approach 1: Merge each adapter into a separate base model copy.

Wi=W+BiAiW'_i = W + B_i A_i for customer ii. Now you have 1,000 copies of the 70B model, each slightly different. At FP16, that is 1,000 x 140 GB = 140 TB of model weights. You need ~1,750 A100-80GB GPUs just for model weights, even before KV cache. This is economically absurd.

Approach 2: Load/unload adapters per request.

Keep one base model and swap LoRA adapters for each request. But adapter swapping requires copying 168 MB from CPU to GPU memory per swap. At PCIe 4.0 (32 GB/s), that is ~5 ms per swap. If consecutive requests use different adapters, you spend 5 ms swapping for every request. At 1,000 QPS, that is 5 seconds of PCIe bandwidth consumed per second — the bus is saturated, and you add 5 ms to every request’s latency.

Approach 3: Keep all adapters in GPU memory.

1,000 adapters x 168 MB = 168 GB. That does not fit in one A100-80GB, or even two. You need at least 3 GPUs just for adapter storage, plus GPUs for the base model and KV cache.

None of these approaches scale. We need something better.

Adapter Management Overhead per Request

(ms)
Pre-merged (zero overhead) But 140 TB total
0 ms
Swap from GPU pool If adapter is in GPU
0.1 ms
Swap from CPU DRAM PCIe 4.0 transfer
5 ms
Swap from SSD NVMe load
25 ms

4. S-LoRA: Scalable Multi-Adapter Serving

S-LoRA (Sheng et al., 2023) solves the multi-adapter serving problem with three key innovations: a shared base model, unified paging for adapter memory, and adapter-aware batching.

Architecture Overview

S-LoRA maintains a single copy of the base model in GPU HBM and stores adapters in a memory pool that spans GPU and CPU memory. The key ideas:

  1. Shared base model: all requests, regardless of adapter, share the same base model weights. The forward pass through WW is identical; only the BAxBAx addend differs per adapter.

  2. Adapter memory pool: adapters are stored in a paged memory pool, similar to how PagedAttention manages KV cache. Hot adapters (frequently used) reside in GPU memory. Cold adapters are in CPU DRAM or SSD. The system dynamically promotes and demotes adapters based on access patterns.

  3. Unified paging: both KV cache blocks and adapter weight blocks are managed by the same paging system. This allows the scheduler to make global decisions about memory allocation: if a popular adapter is consuming GPU memory, it may be worth evicting some KV cache entries to keep the adapter hot.

class SLoRAServer:
    def __init__(self, base_model, gpu_memory_budget, cpu_memory_budget):
        self.base_model = base_model  # Single copy, shared across all requests
        self.adapter_pool = AdapterPool(
            gpu_budget=gpu_memory_budget * 0.15,  # 15% for adapters
            cpu_budget=cpu_memory_budget * 0.5,
        )
        self.kv_pool = KVCachePool(
            gpu_budget=gpu_memory_budget * 0.55,  # 55% for KV cache
        )
        # Remaining 30% for base model weights + CUDA overhead

    def serve_request(self, request):
        adapter_id = request.adapter_id

        # Ensure adapter is in GPU memory
        adapter = self.adapter_pool.get_or_promote(adapter_id)

        # Allocate KV cache for this request
        kv_blocks = self.kv_pool.allocate(request.max_seq_len)

        # Forward pass: base model + adapter
        output = self.forward_with_adapter(request.tokens, adapter, kv_blocks)
        return output

Batched Multi-Adapter Forward Pass

The most technically interesting aspect of S-LoRA is how it handles batched inference with multiple adapters. In a continuous batching system, a single decode iteration may process 32 requests, each potentially using a different adapter.

The naive approach: run the base model forward pass for all 32 requests, then run 32 separate adapter forward passes (one per request). This is inefficient because each adapter forward pass is a tiny matrix operation that cannot saturate the GPU.

S-LoRA’s approach: batch the adapter computations using custom CUDA kernels.

For a batch of NN requests with adapters {(B1,A1),(B2,A2),,(BN,AN)}\{(B_1, A_1), (B_2, A_2), \ldots, (B_N, A_N)\}, the computation is:

yi=Wxi+BiAixifor i=1,,Ny_i = Wx_i + B_i A_i x_i \quad \text{for } i = 1, \ldots, N

The base computation WxiWx_i is a standard batched GEMM. The adapter computation BiAixiB_i A_i x_i requires a grouped GEMM (also called batched GEMM with variable matrices), where each element in the batch uses a different set of A and B matrices.

def batched_adapter_forward(x_batch, base_weight, adapters, adapter_indices):
    """Efficient batched forward pass with multiple LoRA adapters.

    x_batch: [batch_size, d_in] - input activations
    base_weight: [d_out, d_in] - shared base model weight
    adapters: dict mapping adapter_id -> (B, A) matrices
    adapter_indices: [batch_size] - which adapter each request uses
    """
    # Step 1: Base computation (single batched GEMM, shared across all requests)
    base_output = x_batch @ base_weight.T  # [batch_size, d_out]

    # Step 2: Group requests by adapter for efficient batched computation
    grouped = defaultdict(list)
    for i, adapter_id in enumerate(adapter_indices):
        grouped[adapter_id].append(i)

    # Step 3: Compute adapter contributions per group
    adapter_output = torch.zeros_like(base_output)
    for adapter_id, indices in grouped.items():
        B, A = adapters[adapter_id]
        x_group = x_batch[indices]  # [group_size, d_in]

        # Two small GEMMs per adapter group
        hidden = x_group @ A.T     # [group_size, r]
        delta = hidden @ B.T       # [group_size, d_out]
        adapter_output[indices] = delta

    return base_output + adapter_output

In practice, S-LoRA uses CUDA kernels that fuse the grouped operations and use shared memory to avoid redundant data movement. The key optimization is that requests using the same adapter within a batch can share the adapter weight loads, amortizing the memory traffic.

Adapter-Aware Scheduling

The scheduler in S-LoRA prioritizes batching requests that use the same adapter together. This maximizes adapter weight reuse within a batch and minimizes the number of adapter swaps.

📊

S-LoRA Throughput by Number of Active Adapters (Llama-70B, A100-80GB)

Active AdaptersThroughput (req/s)Avg Adapter Swap TimeGPU Memory for AdaptersBase Model Overhead
1 (single adapter) 580 0 ms 168 MB 0%
10 560 0.1 ms 1.7 GB ~1%
100 510 0.5 ms 3.2 GB (pool) + CPU ~3%
1,000 440 1.2 ms 5 GB (pool) + CPU ~8%
10,000 350 2.8 ms 5 GB (pool) + CPU ~15%
Note: r=16 adapters. Adapter pool holds ~30 adapters in GPU, rest in CPU DRAM. Throughput includes adapter swap overhead.

With 1,000 active adapters, S-LoRA maintains 76% of the single-adapter throughput. The overhead comes from adapter swapping (moving adapters between CPU and GPU) and reduced batching efficiency (smaller groups of same-adapter requests in each batch).

ℹ️ The Zipf Distribution Helps

In practice, adapter access patterns follow a Zipf distribution: a few adapters are extremely popular (large enterprise customers), while most adapters are accessed rarely. This means the adapter pool in GPU memory (sized for ~30 adapters) captures the vast majority of traffic. If the top 30 adapters handle 80% of requests, only 20% of requests incur adapter swap overhead.


5. Adapter Merging: When and How

If a single adapter serves all traffic (or a large fraction of it), you can merge the adapter into the base model, eliminating the adapter overhead entirely.

The Merge Operation

Merging is a simple addition:

Wmerged=W+αrBAW'_{\text{merged}} = W + \frac{\alpha}{r} BA

where α\alpha is the LoRA scaling factor. After merging, the model has no adapter — it is a standard model with modified weights. The forward pass is exactly y=Wxy = W'x, with zero adapter overhead.

def merge_adapter(base_weight: torch.Tensor, lora_A: torch.Tensor,
                  lora_B: torch.Tensor, alpha: float, rank: int) -> torch.Tensor:
    """Merge a LoRA adapter into the base weight matrix.

    Returns the merged weight with the adapter baked in.
    """
    scaling = alpha / rank
    merged = base_weight + scaling * (lora_B @ lora_A)
    return merged

def unmerge_adapter(merged_weight: torch.Tensor, lora_A: torch.Tensor,
                    lora_B: torch.Tensor, alpha: float, rank: int) -> torch.Tensor:
    """Reverse the merge to recover the base weight.

    Useful for switching between merged and unmerged modes.
    """
    scaling = alpha / rank
    base = merged_weight - scaling * (lora_B @ lora_A)
    return base

When to Merge

Merge when:

  • A single adapter handles 100% of traffic (single-tenant deployment)
  • The adapter is stable and will not be updated frequently
  • You need maximum throughput with zero adapter overhead
  • The deployment is latency-sensitive and even 0.2% overhead matters

Keep separate when:

  • Multiple adapters serve different customers (multi-tenant)
  • Adapters are updated frequently (daily or weekly retraining)
  • You need to A/B test adapter versions
  • You want to share the base model across adapters for memory efficiency

The Hybrid Approach

Production systems often use a hybrid: merge the most popular adapter into the base model and keep other adapters separate. This gives zero overhead for the majority of traffic while maintaining flexibility for the long tail.

📊

Merge vs. Separate: Serving Performance (Llama-70B, A100-80GB)

ConfigurationThroughput (req/s)Latency OverheadMemory UsageFlexibility
Merged (single adapter) 610 0% 140 GB (FP16) None -- one model
Separate (single adapter, GPU) 595 ~0.2% 140 GB + 168 MB Can swap adapters
Merged primary + 10 separate 580 ~1% (10% of requests) 140 GB + 1.7 GB Good balance
All separate (1,000 adapters) 440 ~8% 35 GB + 5 GB pool Maximum flexibility
Note: The 'merged primary' row assumes 90% of traffic uses the merged adapter. 4-bit base model used for the 1,000-adapter case.

The “merged primary + separate secondaries” configuration achieves 95% of merged throughput while retaining the ability to serve multiple adapters. This is the pattern most production systems converge on.


6. LoRA Rank Tradeoffs

The rank rr is the single most important hyperparameter in LoRA. It controls the capacity of the adapter (how much the model can change), the memory footprint, and the inference overhead.

Quality vs. Rank

Higher rank allows the adapter to capture more complex modifications, but with diminishing returns. The relationship between rank and quality follows a log-like curve: doubling the rank from 4 to 8 gives a significant quality boost, but doubling from 64 to 128 gives a marginal one.

Task Quality vs. LoRA Rank (Llama-70B, domain-specific fine-tuning)

(% accuracy)
r=1 Underfitting
68 % accuracy
r=4 Acceptable
78 % accuracy
r=8 Good
83 % accuracy
r=16 Strong
86 % accuracy
r=32 Diminishing returns
87 % accuracy
r=64 Near ceiling
88 % accuracy
r=128 Marginal gain
88 % accuracy

The “knee” of the curve is typically between r=8r = 8 and r=32r = 32, depending on the task complexity:

  • Simple style adaptation (tone, formatting): r=4r = 4-88 is sufficient
  • Domain knowledge injection (medical, legal, financial): r=16r = 16-3232 is typical
  • Complex task adaptation (code generation, math reasoning): r=32r = 32-6464 may be needed
  • Near full fine-tuning quality: r=128r = 128-256256, but at this point, consider full fine-tuning

Serving Cost vs. Rank

The serving cost of LoRA is dominated by memory, not compute. The compute overhead is negligible at any practical rank. But the memory cost scales linearly:

📊

LoRA Rank: Quality, Memory, and Serving Tradeoffs (Llama-70B)

RankAdapter Size (FP16)Adapters in 10 GB GPU PoolCompute OverheadTypical Quality
r=4 42 MB 238 0.05% Style/tone only
r=8 84 MB 119 0.1% Acceptable
r=16 168 MB 59 0.2% Good (common choice)
r=32 336 MB 29 0.4% Strong
r=64 672 MB 14 0.8% Near full FT
r=128 1.34 GB 7 1.6% Diminishing returns
Note: Adapter pool of 10 GB on GPU. Higher rank = fewer adapters cached on GPU = more swapping.

The practical implication: at r=16r = 16, you can keep 59 adapters hot in a 10 GB GPU pool. At r=64r = 64, only 14 fit. If you have 1,000 adapters with Zipfian access patterns, the top 59 adapters (at r=16r = 16) might cover 90% of traffic, while the top 14 adapters (at r=64r = 64) might cover only 60%. This means higher rank not only costs more memory but also increases the adapter swap rate, compounding the overhead.

💡 Rank Selection Heuristic

Start with r=16r = 16. Train for your target task and evaluate. If quality is insufficient, double to r=32r = 32. If quality at r=16r = 16 matches r=32r = 32 within noise, drop to r=8r = 8. The goal is the lowest rank that meets your quality bar — every halving of rank doubles the number of adapters you can cache.

Rank and Quantization Interaction

An interesting tradeoff emerges when combining QLoRA (4-bit base) with different LoRA ranks. The quality loss from 4-bit quantization can be partially recovered by increasing the LoRA rank:

Quality(4-bit,r=32)Quality(16-bit,r=16)\text{Quality}(4\text{-bit}, r=32) \approx \text{Quality}(16\text{-bit}, r=16)

This suggests a strategy: use aggressive base model quantization (4-bit) with a higher LoRA rank (r=32r = 32) instead of milder quantization (8-bit) with a lower rank (r=16r = 16). The 4-bit + r=32r = 32 configuration uses less total memory while achieving comparable quality.


7. vLLM and SGLang Multi-LoRA Support

Both major serving frameworks now support multi-LoRA inference, though with different architectures and performance characteristics.

vLLM Multi-LoRA

vLLM supports multi-LoRA serving through its --enable-lora flag. The key design choices:

  • Adapter storage: adapters are stored in CPU memory and loaded to GPU on demand
  • Max concurrent adapters: configurable via --max-loras (how many can be active on GPU simultaneously)
  • LoRA request routing: the OpenAI-compatible API accepts a model parameter that maps to a specific adapter
  • Batching: requests with different adapters can share a batch, but the adapter forward pass is serialized per-adapter group
# vLLM multi-LoRA serving
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-hf \
    --enable-lora \
    --lora-modules customer-a=./adapters/customer_a \
                   customer-b=./adapters/customer_b \
                   customer-c=./adapters/customer_c \
    --max-loras 4 \
    --max-lora-rank 64 \
    --max-cpu-loras 100

The --max-loras parameter controls GPU-resident adapter slots. With max-loras=4, up to 4 adapters reside in GPU memory simultaneously. Requests for a fifth adapter trigger an eviction (LRU) and a CPU-to-GPU transfer.

SGLang Multi-LoRA

SGLang’s multi-LoRA support integrates with its RadixAttention prefix caching. The combination is powerful: the system can cache the base model’s KV for a shared system prompt and apply different adapters to different requests, sharing the prefill work but personalizing the output.

SGLang uses a similar adapter pool design but benefits from its more aggressive scheduling and memory management.

Performance Comparison

📊

Multi-LoRA Serving: vLLM vs. SGLang (Llama-70B, A100-80GB)

ScenariovLLM ThroughputSGLang ThroughputvLLM TTFTSGLang TTFT
Single adapter 580 req/s 610 req/s 12 ms 10 ms
10 adapters (uniform) 520 req/s 555 req/s 18 ms 15 ms
100 adapters (Zipf) 440 req/s 490 req/s 25 ms 20 ms
1,000 adapters (Zipf) 350 req/s 400 req/s 38 ms 30 ms
Note: r=16, 4-bit base model, ShareGPT conversation distribution. Zipf access pattern with alpha=1.0.

SGLang consistently outperforms vLLM in multi-LoRA scenarios, primarily due to better scheduling and the interaction between RadixAttention prefix caching and adapter management. The gap widens with more adapters because SGLang’s scheduler is more effective at grouping same-adapter requests.


8. Production Patterns

Pattern 1: Single Base + Thousands of Customer Adapters

This is the most common enterprise pattern. A single base model (typically 4-bit quantized) serves all customers. Each customer has a LoRA adapter stored on SSD. The adapter pool in GPU memory holds the top 20-50 adapters, covering 80-95% of traffic.

Architecture:

  • 4-bit base model on GPU (~35 GB for 70B)
  • GPU adapter pool: 5-10 GB (30-60 adapters at r=16r = 16)
  • CPU adapter pool: 32-64 GB (hundreds of adapters)
  • SSD adapter storage: 1+ TB (all adapters)
  • KV cache + prefix cache: remaining GPU memory

Request flow:

  1. Request arrives with customer ID
  2. Router checks cache-aware registry for adapter location
  3. If adapter is on GPU: proceed directly to inference
  4. If adapter is on CPU: promote to GPU (5 ms), then inference
  5. If adapter is on SSD: load to CPU then GPU (25-50 ms), then inference

Request Latency by Adapter Cache Tier

(ms)
GPU hit (top 30) ~85% of requests
12 ms
CPU hit (top 200) ~12% of requests
18 ms
SSD load (cold) ~3% of requests
55 ms

Pattern 2: Merged Primary + Adapter Overrides

For platforms where 80%+ of traffic uses a single “default” model, merge the primary adapter into the base weights and keep specialty adapters separate.

Architecture:

  • Merged base model (FP16, primary adapter baked in): 140 GB across 2 GPUs
  • Separate specialty adapters: 10-50 in GPU pool
  • Default traffic: zero adapter overhead
  • Specialty traffic: minimal LoRA overhead

This is particularly effective for platforms that have a strong default model but allow premium customers to customize.

Pattern 3: Adapter Version Management

Adapters are retrained regularly (weekly or monthly) as customer data accumulates. The serving system must handle adapter version transitions:

class AdapterVersionManager:
    def __init__(self):
        self.active_versions: Dict[str, str] = {}  # customer_id -> version_id
        self.adapter_store: Dict[str, Dict[str, LoRAWeights]] = {}  # customer_id -> {version_id -> weights}

    def deploy_new_version(self, customer_id: str, version_id: str,
                           weights: LoRAWeights, rollout_pct: float = 0.1):
        """Gradual rollout of a new adapter version."""
        self.adapter_store[customer_id][version_id] = weights

        # Route rollout_pct of traffic to new version
        # Remaining traffic continues with current version
        self.rollout_config[customer_id] = {
            'new_version': version_id,
            'rollout_pct': rollout_pct,
        }

    def get_adapter(self, customer_id: str, request_id: str) -> LoRAWeights:
        """Get the adapter for a request, respecting rollout configuration."""
        if customer_id in self.rollout_config:
            config = self.rollout_config[customer_id]
            if hash(request_id) % 100 < config['rollout_pct'] * 100:
                return self.adapter_store[customer_id][config['new_version']]

        active_version = self.active_versions[customer_id]
        return self.adapter_store[customer_id][active_version]

Gradual rollout ensures that a bad adapter does not immediately affect all traffic. The system monitors quality metrics during rollout and can automatically roll back if metrics degrade.

Pattern 4: Adapter Composition

Some advanced deployments compose multiple adapters for a single request. For example, a customer might have a “domain” adapter (trained on their data) and a “style” adapter (trained for their preferred output format). The composed forward pass is:

y=Wx+BdomainAdomainx+BstyleAstylexy = Wx + B_{\text{domain}} A_{\text{domain}} x + B_{\text{style}} A_{\text{style}} x

This is mathematically equivalent to having a single adapter with B=[Bdomain,Bstyle]B = [B_{\text{domain}}, B_{\text{style}}] and A=[AdomainT,AstyleT]TA = [A_{\text{domain}}^T, A_{\text{style}}^T]^T, but it avoids retraining when either component changes. The serving overhead is doubled (two adapter forward passes per layer), but at r=16r = 16 the overhead goes from 0.2% to 0.4% — still negligible.

⚠️ Adapter Composition Is Not Guaranteed to Work

Composing independently trained adapters can produce unpredictable results. The combined update B1A1+B2A2B_1 A_1 + B_2 A_2 may not be equivalent to an adapter trained on the union of both datasets. In practice, composition works well when the adapters are “orthogonal” (modifying different aspects of behavior) but poorly when they conflict (both trying to change the same output distribution). Always validate composed adapters on evaluation data before deployment.


9. Cost Analysis: Multi-Adapter vs. Multi-Model

To close the loop, let us compare the total cost of serving 1,000 customers with personalized models.

📊

Cost Comparison: 1,000 Personalized Models (Llama-70B, 1,000 QPS Total)

ApproachGPUs RequiredMonthly Cost (A100)TTFT P50Flexibility
1,000 separate full models 1,750+ A100s $3,500,000+ 10 ms Maximum
S-LoRA (4-bit base, r=16) 4 A100s $8,000 20 ms High
S-LoRA (FP16 base, r=16) 8 A100s $16,000 15 ms High
10 merged models (top 10 customers) 20 A100s $40,000 10 ms Low
Merged primary + S-LoRA for rest 6 A100s $12,000 12 ms (P50) Good
Note: A100-80GB at ~$2/hr list price. 1,000 QPS distributed across adapters with Zipf access pattern.

The S-LoRA approach is 437x cheaper than deploying separate models. Even accounting for the slight TTFT increase and the engineering complexity, the economics are overwhelming. Multi-adapter serving with LoRA is the only practical way to offer personalized LLM models at scale.

The “merged primary + S-LoRA” hybrid offers the best balance for most production deployments: near-optimal latency for the majority of traffic (the merged primary adapter) with full flexibility for the long tail (S-LoRA for the remaining 5-20% of requests using specialty adapters).


Conclusion

LoRA transforms the economics of personalized LLM serving. The math is clean — a rank-16 update adds 0.2% compute overhead while capturing the vast majority of task-specific behavior. QLoRA extends this further by quantizing the shared base model to 4-bit, fitting a 70B model on a single GPU. S-LoRA solves the systems challenge of serving thousands of adapters efficiently, using unified paging and adapter-aware batching to maintain high throughput even with extreme adapter diversity.

The key decisions for production deployment are:

  1. Rank selection: start at r=16r = 16, go lower if quality allows, higher only if needed
  2. Base quantization: use 4-bit (QLoRA) unless your quality bar demands FP16
  3. Merge vs. separate: merge the primary adapter if one dominates; keep separate for multi-tenant
  4. Cache hierarchy: size the GPU adapter pool for the top 30-50 adapters; use CPU DRAM for the next few hundred
  5. Routing: use adapter-aware routing to maximize GPU adapter cache hits

The next post in this series moves from model-level optimization to system architecture: disaggregated prefill-decode serving, where we split the inference pipeline across dedicated hardware pools.