Autoscaling a web service is a solved problem. Request rate goes up, spin up more instances. Each instance takes 2-5 seconds to boot. The scale-up latency is small relative to the traffic spike duration, so reactive scaling works. LLM inference breaks this model completely. Loading a 70B parameter model takes 30-60 seconds: read 140 GB from disk, copy to GPU memory, run warmup inference. A 405B model across 8 GPUs takes 90-180 seconds. By the time the new replica is ready, a traffic spike that started 2 minutes ago may have already subsided.
This mismatch between scaling latency and traffic dynamics makes LLM autoscaling a fundamentally different problem. The solutions require a combination of predictive scaling (add capacity before it is needed), warm pools (pre-loaded replicas sitting idle), fast model loading (reducing cold start from 60s to under 1s), and careful scale-down policies (do not remove capacity too quickly).
This post covers each of these topics in depth, with a production-grade autoscaler implementation at the end.
The Cold Start Problem
1.1 Anatomy of a Cold Start
When a new LLM inference replica starts, it goes through a fixed sequence of steps. Each step has a cost that scales with model size.
import time
def cold_start_timeline(model_size_gb, num_gpus=1,
disk_bw_gbps=3.0, pcie_bw_gbps=25.0,
network_bw_gbps=10.0, warmup_tokens=128):
"""
Model the cold start timeline for an LLM replica.
Returns breakdown of time per phase in seconds.
"""
# Phase 1: Container/process startup
container_start_s = 5.0 # fixed overhead
# Phase 2: Model weight loading from storage
# Source could be: local NVMe, network filesystem, S3
if num_gpus > 1:
# Tensor parallel: each GPU loads its shard
per_gpu_gb = model_size_gb / num_gpus
load_from_disk_s = per_gpu_gb / disk_bw_gbps
else:
load_from_disk_s = model_size_gb / disk_bw_gbps
# Phase 3: Weight transfer to GPU (if loading to CPU first)
gpu_transfer_s = model_size_gb / pcie_bw_gbps
# Phase 4: KV cache pre-allocation
# Typically 40-60% of remaining GPU memory
kv_alloc_s = 0.5 # CUDA malloc for pre-allocated pool
# Phase 5: CUDA graph compilation (if used)
cuda_graph_s = 3.0 * num_gpus # compile for common batch sizes
# Phase 6: Warmup inference (JIT compilation, cuBLAS tuning)
warmup_s = 2.0
total = (container_start_s + load_from_disk_s + gpu_transfer_s +
kv_alloc_s + cuda_graph_s + warmup_s)
return {
'container_start': container_start_s,
'weight_loading': load_from_disk_s,
'gpu_transfer': gpu_transfer_s,
'kv_alloc': kv_alloc_s,
'cuda_graphs': cuda_graph_s,
'warmup': warmup_s,
'total': total,
}
Cold Start Breakdown by Model Size
| Model | Size (GB) | GPUs | Weight Load | GPU Transfer | CUDA Graphs | Total Cold Start |
|---|---|---|---|---|---|---|
| Llama 3 8B (FP16) | 16 | 1 | 5.3s | 0.6s | 3.0s | 16.4s |
| Llama 3 70B (FP16) | 140 | 4 (TP) | 11.7s | 5.6s | 12.0s | 36.8s |
| Llama 3 70B (INT8) | 70 | 2 (TP) | 11.7s | 2.8s | 6.0s | 28.0s |
| Llama 3 405B (FP16) | 810 | 8 (TP) | 33.8s | 32.4s | 24.0s | 98.7s |
| Mixtral 8x22B (FP16) | 268 | 4 (TP) | 22.3s | 10.7s | 12.0s | 53.5s |
1.2 Cold Start vs Traffic Dynamics
The problem becomes clear when you compare cold start duration to typical traffic spike characteristics.
Cold Start Duration vs Traffic Spike Duration
(seconds)A 30-second API traffic spike is over before a 70B replica finishes loading. Reactive autoscaling is useless for this scenario. The replica comes online just as load returns to baseline, and then scale-down removes it 5 minutes later โ wasting 6 minutes of GPU-hours for zero useful work.
A single false-positive scale-up event for a 70B model on 4x A100 wastes approximately \2.40$3.00$24$720$/month per model deployment โ a meaningful cost for a service running 10+ model deployments.
Scaling Signals
2.1 Signal Taxonomy
Autoscaling signals fall into three categories based on their temporal relationship to load:
-
Reactive signals: measure current state. Queue depth, active request count, GPU memory utilization. By the time these trigger, load is already high and cold start lag means the response is too late.
-
Predictive signals: forecast future state. Request rate trend (linear regression over 5-minute window), time-of-day patterns, scheduled batch job arrivals. These trigger before load arrives, giving cold start time to complete.
-
Lagging signals: measure historical state. GPU utilization averaged over 5 minutes, throughput over the last hour. Useful for scale-down decisions (confirming load has truly decreased) but too slow for scale-up.
import collections
import math
import time as time_mod
class ScalingSignal:
"""Base class for scaling signals."""
def __init__(self, name, weight=1.0):
self.name = name
self.weight = weight
def compute(self):
"""
Returns a value in [0.0, 1.0] where:
- 0.0 = no scaling needed
- 0.5 = moderate pressure
- 1.0 = maximum pressure, scale up immediately
"""
raise NotImplementedError
class QueueDepthSignal(ScalingSignal):
"""
Reactive signal: current queue depth relative to capacity.
Fast to respond but already too late for cold starts.
"""
def __init__(self, queue, max_queue_depth=100, weight=1.0):
super().__init__("queue_depth", weight)
self.queue = queue
self.max_depth = max_queue_depth
def compute(self):
depth = len(self.queue)
if depth == 0:
return 0.0
return min(depth / self.max_depth, 1.0)
class RequestRateTrendSignal(ScalingSignal):
"""
Predictive signal: linear regression on request rate.
If rate is increasing, predicts future load before it arrives.
"""
def __init__(self, window_seconds=300, sample_interval=10,
weight=1.5):
super().__init__("request_rate_trend", weight)
self.window = window_seconds
self.interval = sample_interval
self.samples = collections.deque(
maxlen=window_seconds // sample_interval
)
self.request_count = 0
self.last_sample_time = time_mod.time()
def record_request(self):
self.request_count += 1
def sample(self):
"""Called periodically to record current rate."""
now = time_mod.time()
elapsed = now - self.last_sample_time
if elapsed > 0:
rate = self.request_count / elapsed
self.samples.append((now, rate))
self.request_count = 0
self.last_sample_time = now
def compute(self):
if len(self.samples) < 3:
return 0.0
# Linear regression: rate = a * t + b
times = [s[0] for s in self.samples]
rates = [s[1] for s in self.samples]
t_min = times[0]
xs = [t - t_min for t in times]
n = len(xs)
sum_x = sum(xs)
sum_y = sum(rates)
sum_xy = sum(x * y for x, y in zip(xs, rates))
sum_xx = sum(x * x for x in xs)
denom = n * sum_xx - sum_x * sum_x
if abs(denom) < 1e-10:
return 0.0
slope = (n * sum_xy - sum_x * sum_y) / denom
# Positive slope = increasing rate = need more capacity
if slope <= 0:
return 0.0
# Normalize: slope of 10 req/s per second is maximum pressure
max_slope = 10.0 # configurable
return min(slope / max_slope, 1.0)
def predicted_rate(self, horizon_seconds=60):
"""Predict request rate N seconds into the future."""
if len(self.samples) < 3:
if self.samples:
return self.samples[-1][1]
return 0.0
times = [s[0] for s in self.samples]
rates = [s[1] for s in self.samples]
t_min = times[0]
xs = [t - t_min for t in times]
n = len(xs)
sum_x = sum(xs)
sum_y = sum(rates)
sum_xy = sum(x * y for x, y in zip(xs, rates))
sum_xx = sum(x * x for x in xs)
denom = n * sum_xx - sum_x * sum_x
if abs(denom) < 1e-10:
return rates[-1]
slope = (n * sum_xy - sum_x * sum_y) / denom
intercept = (sum_y - slope * sum_x) / n
future_x = xs[-1] + horizon_seconds
predicted = slope * future_x + intercept
return max(predicted, 0.0)
class GPUUtilizationSignal(ScalingSignal):
"""
Lagging signal: GPU utilization averaged over a window.
Good for scale-down decisions, too slow for scale-up.
"""
def __init__(self, window_seconds=300, weight=0.5):
super().__init__("gpu_utilization", weight)
self.samples = collections.deque(
maxlen=window_seconds
)
def record(self, utilization):
"""Record GPU utilization sample (0.0 to 1.0)."""
self.samples.append(utilization)
def compute(self):
if not self.samples:
return 0.0
avg = sum(self.samples) / len(self.samples)
# Map 0-100% utilization to 0-1 signal
# Below 50%: no pressure. Above 80%: high pressure.
if avg < 0.5:
return 0.0
if avg > 0.95:
return 1.0
return (avg - 0.5) / 0.45
class TimeOfDaySignal(ScalingSignal):
"""
Predictive signal: known traffic patterns by time of day.
Uses historical data to predict load before it arrives.
"""
def __init__(self, hourly_pattern=None, weight=1.0):
super().__init__("time_of_day", weight)
# Default: typical API traffic pattern (normalized 0-1)
self.pattern = hourly_pattern or {
0: 0.2, 1: 0.15, 2: 0.1, 3: 0.1, 4: 0.1, 5: 0.15,
6: 0.3, 7: 0.5, 8: 0.7, 9: 0.85, 10: 0.9, 11: 0.95,
12: 0.85, 13: 0.9, 14: 0.95, 15: 0.9, 16: 0.85,
17: 0.8, 18: 0.7, 19: 0.6, 20: 0.5, 21: 0.4,
22: 0.35, 23: 0.25,
}
def compute(self):
import datetime
hour = datetime.datetime.now().hour
return self.pattern.get(hour, 0.5)
def lookahead(self, hours=1):
"""Return predicted load N hours from now."""
import datetime
future_hour = (datetime.datetime.now().hour + hours) % 24
return self.pattern.get(future_hour, 0.5)
2.2 Composite Signal
No single signal is sufficient. The autoscaler must combine signals with appropriate weights.
class CompositeSignal:
"""
Combines multiple scaling signals into a single decision.
Uses weighted average with thresholds for scale-up/down.
"""
def __init__(self, signals, scale_up_threshold=0.6,
scale_down_threshold=0.2):
self.signals = signals
self.scale_up_threshold = scale_up_threshold
self.scale_down_threshold = scale_down_threshold
def evaluate(self):
"""
Compute composite scaling signal.
Returns (action, confidence, signal_values)
where action is 'scale_up', 'scale_down', or 'hold'.
"""
total_weight = sum(s.weight for s in self.signals)
weighted_sum = 0.0
signal_values = {}
for signal in self.signals:
value = signal.compute()
weighted_sum += value * signal.weight
signal_values[signal.name] = value
composite = weighted_sum / total_weight if total_weight > 0 else 0.0
if composite >= self.scale_up_threshold:
return 'scale_up', composite, signal_values
elif composite <= self.scale_down_threshold:
return 'scale_down', composite, signal_values
else:
return 'hold', composite, signal_values
def replicas_needed(self, current_replicas, max_replicas,
capacity_per_replica):
"""
Estimate the number of replicas needed based on
predicted load from all signals.
"""
# Use the request rate trend signal for capacity planning
rate_signal = None
for s in self.signals:
if s.name == "request_rate_trend":
rate_signal = s
break
if rate_signal is None:
return current_replicas
# Predict rate 2 minutes from now (covers cold start)
predicted_rate = rate_signal.predicted_rate(
horizon_seconds=120
)
if predicted_rate <= 0:
return max(1, current_replicas - 1)
# Add 20% headroom
needed = math.ceil(
(predicted_rate * 1.2) / capacity_per_replica
)
return min(max(1, needed), max_replicas)
Signal Effectiveness for Scale-Up Decisions
(% of spikes handled without SLO violation)Warm Pools
3.1 Design
A warm pool maintains a set of pre-loaded replicas that are idle but ready to serve traffic immediately. When the autoscaler decides to scale up, it promotes a warm replica to active status in under 1 second (just flip the load balancer routing) instead of waiting 30-60 seconds for cold start.
import threading
import time as time_mod
class WarmPoolManager:
"""
Manages a pool of pre-loaded LLM replicas ready for
instant promotion to active serving.
"""
def __init__(self, min_warm=1, max_warm=3,
model_loader=None, cost_per_gpu_hour=3.0):
self.min_warm = min_warm
self.max_warm = max_warm
self.model_loader = model_loader
self.cost_per_gpu_hour = cost_per_gpu_hour
self.warm_replicas = [] # loaded, idle
self.active_replicas = [] # loaded, serving traffic
self.loading_replicas = [] # currently loading
self.lock = threading.Lock()
self.stats = {
'promotions': 0,
'demotions': 0,
'cold_starts_avoided': 0,
'idle_gpu_hours': 0.0,
}
def promote(self, count=1):
"""
Promote warm replicas to active.
Returns list of promoted replicas. Near-instant operation.
"""
promoted = []
with self.lock:
for _ in range(count):
if not self.warm_replicas:
break
replica = self.warm_replicas.pop(0)
replica['state'] = 'active'
replica['promoted_at'] = time_mod.time()
self.active_replicas.append(replica)
promoted.append(replica)
self.stats['promotions'] += 1
self.stats['cold_starts_avoided'] += 1
# Refill warm pool asynchronously
self._refill_warm_pool()
return promoted
def demote(self, count=1):
"""
Demote active replicas back to warm pool.
Called during scale-down.
"""
demoted = []
with self.lock:
for _ in range(count):
if not self.active_replicas:
break
if len(self.warm_replicas) >= self.max_warm:
# Warm pool full: actually terminate
replica = self.active_replicas.pop()
self._terminate_replica(replica)
else:
replica = self.active_replicas.pop()
replica['state'] = 'warm'
replica['demoted_at'] = time_mod.time()
self.warm_replicas.append(replica)
demoted.append(replica)
self.stats['demotions'] += 1
return demoted
def _refill_warm_pool(self):
"""Start loading new replicas to maintain warm pool size."""
with self.lock:
deficit = self.min_warm - len(self.warm_replicas) - len(
self.loading_replicas
)
if deficit <= 0:
return
for _ in range(deficit):
thread = threading.Thread(target=self._load_replica)
thread.daemon = True
thread.start()
def _load_replica(self):
"""Load a new replica (cold start -- runs in background)."""
replica = {
'id': f"replica-{time_mod.time_ns()}",
'state': 'loading',
'load_start': time_mod.time(),
}
with self.lock:
self.loading_replicas.append(replica)
# Simulate model loading (in production: actual model load)
if self.model_loader:
model = self.model_loader()
replica['model'] = model
replica['load_end'] = time_mod.time()
replica['load_time'] = replica['load_end'] - replica['load_start']
replica['state'] = 'warm'
with self.lock:
self.loading_replicas.remove(replica)
if len(self.warm_replicas) < self.max_warm:
self.warm_replicas.append(replica)
else:
self._terminate_replica(replica)
def _terminate_replica(self, replica):
"""Release GPU resources for a replica."""
replica['state'] = 'terminated'
# In production: release GPU, delete model, free memory
def idle_cost_per_hour(self):
"""Cost of maintaining warm replicas."""
num_warm = len(self.warm_replicas) + len(self.loading_replicas)
gpus_per_replica = 1 # adjust for TP
return num_warm * gpus_per_replica * self.cost_per_gpu_hour
def get_status(self):
with self.lock:
return {
'warm': len(self.warm_replicas),
'active': len(self.active_replicas),
'loading': len(self.loading_replicas),
'idle_cost_per_hour': self.idle_cost_per_hour(),
'stats': dict(self.stats),
}
3.2 Warm Pool Sizing
The optimal warm pool size balances cold start avoidance against idle GPU cost.
Warm Pool Size vs Cost and SLO Impact (70B model, 4x A100 per replica)
| Warm Pool Size | Idle Cost/hr | Idle Cost/month | Cold Starts Avoided | p99 TTFT Impact |
|---|---|---|---|---|
| 0 (no warm pool) | $0.00 | $0 | 0% | 2100ms (cold start dominates) |
| 1 replica | $12.00 | $8,640 | 85% | 380ms |
| 2 replicas | $24.00 | $17,280 | 96% | 210ms |
| 3 replicas | $36.00 | $25,920 | 99.2% | 195ms |
| 5 replicas | $60.00 | $43,200 | 99.9% | 190ms |
A warm pool of 2 replicas covers 96% of scale-up events for typical API traffic patterns. The marginal benefit of the 3rd replica is small (96% to 99.2%) but may be justified for premium SLO requirements. Beyond 3, the cost-benefit ratio degrades rapidly. For batch workloads with predictable schedules, a warm pool of 0-1 is sufficient because scaling can be triggered well in advance.
Reducing Cold Start: Fast Model Loading
4.1 ModelExpress Architecture
The most impactful optimization is reducing cold start itself. If model loading takes 200ms instead of 60 seconds, reactive scaling becomes viable and warm pools become optional.
Several techniques can compress the loading phase:
class FastModelLoader:
"""
Techniques for reducing LLM cold start time.
Goal: reduce from 30-60s to sub-second.
"""
def __init__(self, model_path, num_gpus=1):
self.model_path = model_path
self.num_gpus = num_gpus
def load_standard(self):
"""
Standard loading: read from disk, deserialize, move to GPU.
Slowest approach. 30-60s for 70B.
"""
import torch
start = time_mod.time()
# Read from disk into CPU memory
state_dict = torch.load(
self.model_path, map_location='cpu'
)
disk_time = time_mod.time() - start
# Move to GPU
gpu_start = time_mod.time()
for key in state_dict:
state_dict[key] = state_dict[key].cuda()
gpu_time = time_mod.time() - gpu_start
return {
'disk_time': disk_time,
'gpu_time': gpu_time,
'total': disk_time + gpu_time
}
def load_mmap(self):
"""
Memory-mapped loading: mmap the file, lazy-load pages
as they are accessed. Reduces initial load time because
only accessed pages are read from disk.
"""
import torch
import mmap
import numpy as np
start = time_mod.time()
# mmap the weight file
with open(self.model_path, 'rb') as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
mmap_time = time_mod.time() - start # near-instant
# Weights are loaded on first access (page fault)
# GPU transfer still needed per-tensor
return {
'mmap_time': mmap_time,
'note': 'Pages loaded on demand during first inference'
}
def load_cuda_ipc(self):
"""
CUDA IPC: share GPU memory between processes.
If another process already has the model loaded,
get a handle to the same GPU memory.
Zero-copy, near-instant.
"""
import torch
import torch.multiprocessing as mp
start = time_mod.time()
# Receive IPC handles from the model server process
# Each tensor's GPU memory is shared, not copied
handles = self._get_ipc_handles()
tensors = {}
for name, handle in handles.items():
# Reconstruct tensor from IPC handle
tensors[name] = self._tensor_from_ipc(handle)
ipc_time = time_mod.time() - start
return {
'ipc_time': ipc_time,
'note': 'Zero-copy, shares existing GPU memory'
}
def load_from_host_cache(self):
"""
Host memory cache: keep model weights pinned in CPU memory.
New replicas copy from CPU cache to GPU.
Eliminates disk I/O; only PCIe transfer remains.
"""
import torch
start = time_mod.time()
# Weights already in pinned CPU memory (pre-cached)
# Just do async GPU copy
for shard in range(self.num_gpus):
device = torch.device(f'cuda:{shard}')
shard_weights = self._get_cached_shard(shard)
for name, tensor in shard_weights.items():
tensor.to(device, non_blocking=True)
torch.cuda.synchronize()
transfer_time = time_mod.time() - start
return {
'transfer_time': transfer_time,
'note': 'PCIe transfer only, no disk I/O'
}
def _get_ipc_handles(self):
"""Placeholder: get IPC handles from model server."""
return {}
def _tensor_from_ipc(self, handle):
"""Placeholder: reconstruct tensor from IPC handle."""
return None
def _get_cached_shard(self, shard_id):
"""Placeholder: get cached weight shard."""
return {}
Cold Start Time by Loading Strategy (70B FP16, 4x A100)
(seconds)4.2 CUDA IPC Deep Dive
CUDA IPC (Inter-Process Communication) is the most promising approach for near-instant cold starts. The idea: one โmodel serverโ process holds the weights in GPU memory. New inference processes open IPC handles to the same GPU memory โ zero copy, zero transfer.
import torch
import torch.cuda
class ModelServer:
"""
Persistent process that holds model weights in GPU memory
and shares them via CUDA IPC handles.
"""
def __init__(self, model_path, devices):
self.devices = devices
self.weights = {} # name -> tensor on GPU
self.ipc_handles = {} # name -> IPC handle
def load_and_share(self, model_path):
"""Load model and create IPC handles for all weight tensors."""
import torch
state_dict = torch.load(model_path)
for name, tensor in state_dict.items():
device_idx = self._shard_to_device(name)
gpu_tensor = tensor.to(f'cuda:{device_idx}')
self.weights[name] = gpu_tensor
# Create IPC handle that other processes can use
handle = gpu_tensor.storage()._share_cuda_()
self.ipc_handles[name] = {
'handle': handle,
'shape': list(tensor.shape),
'dtype': str(tensor.dtype),
'device': device_idx,
}
def get_handles(self):
"""Return IPC handles for a new inference process."""
return self.ipc_handles
def _shard_to_device(self, name):
"""Map weight name to GPU device for tensor parallelism."""
# Simple round-robin; production uses proper TP mapping
hash_val = hash(name) % len(self.devices)
return self.devices[hash_val]
class InferenceWorker:
"""
Inference process that uses CUDA IPC to access shared weights.
Cold start: receive handles + reconstruct tensors.
"""
def __init__(self, ipc_handles):
self.model_weights = {}
self._reconstruct_from_handles(ipc_handles)
def _reconstruct_from_handles(self, handles):
"""Reconstruct model weight tensors from IPC handles."""
import torch
for name, handle_info in handles.items():
device = torch.device(f"cuda:{handle_info['device']}")
dtype = getattr(torch, handle_info['dtype'].split('.')[-1])
# Open shared GPU memory -- no data copy
storage = torch.cuda.StorageBase._new_shared_cuda(
handle_info['handle']
)
tensor = torch.tensor([], dtype=dtype, device=device)
tensor = tensor.set_(storage).reshape(handle_info['shape'])
self.model_weights[name] = tensor
def run_inference(self, input_ids):
"""Run inference using shared weights."""
# Use self.model_weights as the model's state dict
# No copy was needed -- we're reading the same GPU memory
pass
CUDA IPC requires processes on the same machine sharing the same GPUs. It does not work across nodes. The model server process must remain alive โ if it crashes, all workers lose access to the weights. Production deployments typically run the model server as a system daemon with automatic restart, and workers detect the crash and fall back to standard loading.
Scale-Down Policy
5.1 Hysteresis: Preventing Oscillation
The most common autoscaling failure mode is oscillation: scale up on a spike, scale down when it subsides, scale up again on the next spike 2 minutes later. Each cycle wastes a cold startโs worth of GPU-hours.
class ScaleDownPolicy:
"""
Hysteresis-based scale-down policy.
Prevents oscillation by requiring sustained low load
before removing replicas.
"""
def __init__(self, cooldown_seconds=300,
sustained_low_seconds=180,
min_replicas=1,
scale_down_rate=1):
# Cooldown after scale-up before any scale-down
self.cooldown_seconds = cooldown_seconds
# How long load must be low before scale-down triggers
self.sustained_low_seconds = sustained_low_seconds
self.min_replicas = min_replicas
self.scale_down_rate = scale_down_rate # max replicas per step
self.last_scale_up_time = 0
self.low_load_start_time = None
def should_scale_down(self, current_replicas, composite_signal,
signal_value):
"""
Decide whether to scale down.
Returns number of replicas to remove (0 = no action).
"""
now = time_mod.time()
# Respect cooldown after last scale-up
if now - self.last_scale_up_time < self.cooldown_seconds:
return 0
# Cannot go below minimum
if current_replicas <= self.min_replicas:
return 0
# Check if signal indicates low load
if signal_value > 0.3:
# Load is not low -- reset timer
self.low_load_start_time = None
return 0
# Load is low -- start or continue timer
if self.low_load_start_time is None:
self.low_load_start_time = now
return 0
# Check if low load has been sustained
low_duration = now - self.low_load_start_time
if low_duration < self.sustained_low_seconds:
return 0
# Sustained low load: scale down
excess = current_replicas - self.min_replicas
to_remove = min(excess, self.scale_down_rate)
self.low_load_start_time = None # reset timer
return to_remove
def record_scale_up(self):
"""Record that a scale-up just happened."""
self.last_scale_up_time = time_mod.time()
self.low_load_start_time = None
5.2 Cost-Optimal Scale-Down
Scale-down decisions must account for the cost of future scale-ups. Removing a replica saves per hour in idle cost. But if load spikes again, the cold start costs in wasted time and SLO violations. The optimal policy keeps a replica if the expected future utilization within the next minutes exceeds a threshold.
class CostOptimalScaler:
"""
Makes scale-down decisions based on cost optimization.
Compares the cost of keeping an idle replica vs the expected
cost of a future cold start.
"""
def __init__(self, cost_per_gpu_hour=3.0, gpus_per_replica=4,
cold_start_seconds=40, spike_probability_per_hour=2.0,
slo_violation_cost=5.0):
self.gpu_cost = cost_per_gpu_hour
self.gpus = gpus_per_replica
self.cold_start_s = cold_start_seconds
self.spike_prob = spike_probability_per_hour # spikes/hour
self.slo_cost = slo_violation_cost # $ per violation
def keep_or_remove(self, idle_minutes, predicted_load_next_hour):
"""
Decide whether to keep an idle replica.
Returns ('keep', reason) or ('remove', reason).
"""
# Cost of keeping for the next hour
keep_cost = self.gpu_cost * self.gpus # $ per hour
# Expected cost of NOT having the replica when needed
# = P(spike in next hour) * cost_of_cold_start_during_spike
cold_start_cost_per_spike = (
# Wasted GPU-hours during cold start
(self.cold_start_s / 3600) * self.gpu_cost * self.gpus +
# SLO violations during cold start
self.slo_cost * (self.cold_start_s / 60)
)
# Adjust spike probability based on predicted load
adjusted_spike_prob = self.spike_prob * predicted_load_next_hour
remove_expected_cost = (
adjusted_spike_prob * cold_start_cost_per_spike
)
if keep_cost < remove_expected_cost:
return 'keep', (
f"Keep cost ${keep_cost:.2f}/hr less than "
f"expected cold start cost ${remove_expected_cost:.2f}/hr"
)
else:
return 'remove', (
f"Keep cost ${keep_cost:.2f}/hr exceeds "
f"expected cold start cost ${remove_expected_cost:.2f}/hr"
)
def optimal_warm_pool_size(self, hourly_spike_rate,
max_pool_size=5):
"""
Compute cost-optimal warm pool size.
Each warm replica eliminates one cold start per spike.
"""
best_size = 0
best_total_cost = float('inf')
for size in range(max_pool_size + 1):
# Cost of maintaining warm pool
warm_cost = size * self.gpu_cost * self.gpus # per hour
# Expected cold starts per hour with this pool size
# If pool_size >= spike_magnitude, no cold starts
# Simplified: assume each spike needs 1 replica
expected_cold_starts = max(0, hourly_spike_rate - size)
cold_start_cost = (
expected_cold_starts *
(self.cold_start_s / 3600) *
self.gpu_cost * self.gpus
)
slo_cost = expected_cold_starts * self.slo_cost
total = warm_cost + cold_start_cost + slo_cost
if total < best_total_cost:
best_total_cost = total
best_size = size
return best_size, best_total_cost
Complete Autoscaler
6.1 Integration
import time as time_mod
import threading
import logging
logger = logging.getLogger(__name__)
class LLMAutoscaler:
"""
Production autoscaler for LLM inference.
Integrates: composite signals, warm pool, fast loading,
hysteresis, and cost optimization.
"""
def __init__(self, config):
self.config = config
# Scaling signals
self.queue = [] # shared reference to request queue
self.signals = CompositeSignal(
signals=[
QueueDepthSignal(
self.queue,
max_queue_depth=config.get('max_queue', 100),
weight=1.0
),
RequestRateTrendSignal(
window_seconds=300,
weight=1.5
),
GPUUtilizationSignal(
window_seconds=300,
weight=0.5
),
TimeOfDaySignal(weight=0.8),
],
scale_up_threshold=config.get('scale_up_threshold', 0.6),
scale_down_threshold=config.get('scale_down_threshold', 0.2),
)
# Warm pool
self.warm_pool = WarmPoolManager(
min_warm=config.get('min_warm_replicas', 2),
max_warm=config.get('max_warm_replicas', 3),
cost_per_gpu_hour=config.get('cost_per_gpu_hour', 3.0),
)
# Scale-down policy
self.scale_down = ScaleDownPolicy(
cooldown_seconds=config.get('scale_down_cooldown', 300),
sustained_low_seconds=config.get('sustained_low', 180),
min_replicas=config.get('min_replicas', 1),
)
# Cost optimizer
self.cost_optimizer = CostOptimalScaler(
cost_per_gpu_hour=config.get('cost_per_gpu_hour', 3.0),
gpus_per_replica=config.get('gpus_per_replica', 4),
cold_start_seconds=config.get('cold_start_seconds', 40),
)
# State
self.current_replicas = config.get('initial_replicas', 2)
self.max_replicas = config.get('max_replicas', 10)
self.running = False
self.scaling_history = []
def start(self):
"""Start the autoscaler loop in a background thread."""
self.running = True
self.warm_pool._refill_warm_pool()
thread = threading.Thread(target=self._scaling_loop, daemon=True)
thread.start()
logger.info(
"Autoscaler started. Replicas: %d, Warm: %d",
self.current_replicas, self.warm_pool.min_warm
)
def stop(self):
self.running = False
def _scaling_loop(self):
"""Main scaling loop. Runs every evaluation_interval."""
interval = self.config.get('evaluation_interval', 15)
while self.running:
try:
self._evaluate_and_act()
except Exception as e:
logger.error("Autoscaler error: %s", e)
time_mod.sleep(interval)
def _evaluate_and_act(self):
"""Single evaluation cycle."""
# Sample request rate signal
for signal in self.signals.signals:
if hasattr(signal, 'sample'):
signal.sample()
action, confidence, signal_values = self.signals.evaluate()
logger.debug(
"Signals: %s -> action=%s confidence=%.2f",
signal_values, action, confidence
)
if action == 'scale_up':
self._handle_scale_up(confidence, signal_values)
elif action == 'scale_down':
self._handle_scale_down(confidence, signal_values)
def _handle_scale_up(self, confidence, signal_values):
"""Handle scale-up decision."""
if self.current_replicas >= self.max_replicas:
logger.warning("At max replicas (%d), cannot scale up",
self.max_replicas)
return
# Determine how many replicas to add
needed = self.signals.replicas_needed(
self.current_replicas,
self.max_replicas,
self.config.get('capacity_per_replica', 50)
)
to_add = needed - self.current_replicas
if to_add <= 0:
return
# Try warm pool first (instant)
warm_status = self.warm_pool.get_status()
from_warm = min(to_add, warm_status['warm'])
if from_warm > 0:
promoted = self.warm_pool.promote(from_warm)
self.current_replicas += len(promoted)
logger.info(
"Promoted %d warm replicas. Active: %d",
len(promoted), self.current_replicas
)
# Remaining need cold start
remaining = to_add - from_warm
if remaining > 0:
self._cold_start_replicas(remaining)
self.scale_down.record_scale_up()
self.scaling_history.append({
'time': time_mod.time(),
'action': 'scale_up',
'from_warm': from_warm,
'cold_start': remaining,
'total_replicas': self.current_replicas,
'confidence': confidence,
'signals': signal_values,
})
def _handle_scale_down(self, confidence, signal_values):
"""Handle scale-down decision."""
to_remove = self.scale_down.should_scale_down(
self.current_replicas,
self.signals,
1.0 - confidence # invert: low signal = high confidence to scale down
)
if to_remove == 0:
return
# Cost check: should we keep for future spikes?
tod_signal = None
for s in self.signals.signals:
if s.name == "time_of_day":
tod_signal = s
break
predicted_load = tod_signal.lookahead(1) if tod_signal else 0.5
for _ in range(to_remove):
decision, reason = self.cost_optimizer.keep_or_remove(
idle_minutes=5,
predicted_load_next_hour=predicted_load
)
if decision == 'keep':
logger.info("Cost optimizer: keeping replica. %s",
reason)
break
# Demote to warm pool (or terminate if pool is full)
demoted = self.warm_pool.demote(1)
self.current_replicas -= 1
logger.info(
"Scaled down. Active: %d, Demoted to warm: %d",
self.current_replicas, len(demoted)
)
self.scaling_history.append({
'time': time_mod.time(),
'action': 'scale_down',
'removed': to_remove,
'total_replicas': self.current_replicas,
'confidence': confidence,
'signals': signal_values,
})
def _cold_start_replicas(self, count):
"""Start cold-loading replicas (async)."""
for _ in range(count):
self.current_replicas += 1
# In production: launch container, load model
logger.info(
"Cold-starting replica. Active: %d (loading...)",
self.current_replicas
)
def get_status(self):
"""Return full autoscaler status."""
_, confidence, signal_values = self.signals.evaluate()
return {
'active_replicas': self.current_replicas,
'max_replicas': self.max_replicas,
'warm_pool': self.warm_pool.get_status(),
'composite_signal': confidence,
'signal_values': signal_values,
'scaling_history_last_10': self.scaling_history[-10:],
'cost': {
'active_per_hour': (
self.current_replicas *
self.config.get('gpus_per_replica', 4) *
self.config.get('cost_per_gpu_hour', 3.0)
),
'warm_per_hour': self.warm_pool.idle_cost_per_hour(),
},
}
6.2 Autoscaler Performance Comparison
Autoscaler Configurations Under Realistic Traffic (24hr simulation)
| Configuration | Avg Replicas | GPU Cost/day | p99 TTFT | SLO Compliance | Cold Starts/day |
|---|---|---|---|---|---|
| Fixed (peak provisioned) | 8.0 | $576 | 180ms | 99.9% | 0 |
| Reactive (queue depth) | 4.2 | $302 | 3200ms | 71% | 38 |
| Predictive (rate trend) | 4.8 | $346 | 820ms | 89% | 14 |
| Composite signals | 4.5 | $324 | 580ms | 92% | 9 |
| Composite + warm pool (2) | 4.5 + 2 warm | $420 | 210ms | 98.5% | 2 |
| Composite + warm + cost opt | 4.3 + 1.5 warm | $384 | 240ms | 97.8% | 3 |
Compared to fixed peak provisioning (192/day (33%) while maintaining 97.8% SLO compliance. The warm pool adds $96/day in idle cost but eliminates 36 out of 38 daily cold starts, which would each cause SLO violations affecting hundreds of requests. The net value of the warm pool is strongly positive when SLO violations have business impact.
Scaling Signal Calibration
7.1 Threshold Tuning
The composite signal thresholds (scale-up at 0.6, scale-down at 0.2) are not universal. They depend on traffic characteristics, cold start duration, and SLO requirements. A systematic approach is to simulate the autoscaler against historical traffic and optimize thresholds.
class ThresholdOptimizer:
"""
Optimize autoscaler thresholds by replaying historical traffic.
Minimizes: cost + SLO_violation_penalty.
"""
def __init__(self, traffic_trace, cold_start_s, gpus_per_replica,
cost_per_gpu_hour, slo_ttft_ms, violation_penalty):
self.trace = traffic_trace # list of (timestamp, request_rate)
self.cold_start = cold_start_s
self.gpus = gpus_per_replica
self.gpu_cost = cost_per_gpu_hour
self.slo = slo_ttft_ms
self.penalty = violation_penalty
def simulate(self, scale_up_thresh, scale_down_thresh,
warm_pool_size):
"""
Simulate autoscaler with given thresholds.
Returns (total_cost, slo_compliance, avg_replicas).
"""
replicas = 1
warm = warm_pool_size
total_cost = 0.0
violations = 0
total_requests = 0
last_scale_up = 0
for i, (ts, rate) in enumerate(self.trace):
# Compute simple composite signal (rate-based)
capacity = replicas * 50 # requests/s per replica
load_ratio = rate / capacity if capacity > 0 else 1.0
if load_ratio > scale_up_thresh:
if warm > 0:
replicas += 1
warm -= 1
# Instant: no cold start
else:
# Cold start: capacity gap during loading
violations += int(
rate * self.cold_start *
max(0, load_ratio - 1.0)
)
replicas += 1
last_scale_up = ts
if (load_ratio < scale_down_thresh and
replicas > 1 and
ts - last_scale_up > 300):
replicas -= 1
if warm < warm_pool_size:
warm += 1
# Cost accumulation (per-second)
per_second_cost = (
(replicas + warm) * self.gpus *
self.gpu_cost / 3600
)
total_cost += per_second_cost
total_requests += int(rate)
slo_compliance = 1.0 - (violations / max(total_requests, 1))
trace_duration_hours = (
(self.trace[-1][0] - self.trace[0][0]) / 3600
)
avg_replicas = total_cost / (
self.gpus * self.gpu_cost / 3600 * len(self.trace)
)
total_cost += violations * self.penalty
return total_cost, slo_compliance, avg_replicas
def optimize(self, warm_pool_sizes=None):
"""
Grid search over thresholds and warm pool sizes.
Returns best configuration.
"""
if warm_pool_sizes is None:
warm_pool_sizes = [0, 1, 2, 3]
best_config = None
best_cost = float('inf')
for up_thresh in [0.4, 0.5, 0.6, 0.7, 0.8]:
for down_thresh in [0.1, 0.15, 0.2, 0.25, 0.3]:
if down_thresh >= up_thresh:
continue
for warm in warm_pool_sizes:
cost, compliance, avg_rep = self.simulate(
up_thresh, down_thresh, warm
)
if cost < best_cost:
best_cost = cost
best_config = {
'scale_up_threshold': up_thresh,
'scale_down_threshold': down_thresh,
'warm_pool_size': warm,
'total_cost': cost,
'slo_compliance': compliance,
'avg_replicas': avg_rep,
}
return best_config
7.2 Practical Configuration Guidelines
Recommended Autoscaler Configuration by Deployment Profile
| Profile | Scale-Up Threshold | Scale-Down Threshold | Warm Pool | Cooldown | Sustained Low |
|---|---|---|---|---|---|
| Cost-sensitive API | 0.7 | 0.15 | 1 | 300s | 300s |
| Balanced API | 0.6 | 0.2 | 2 | 300s | 180s |
| Latency-critical API | 0.5 | 0.25 | 3 | 600s | 300s |
| Batch processing | 0.8 | 0.1 | 0 | 120s | 60s |
| Multi-tenant platform | 0.55 | 0.2 | 2 | 300s | 240s |