Your inference service crashed and auto-scaled a fresh GPU. The first request arrives 200 milliseconds later, but your model takes 20 seconds to load from NVMe. That request — and every other one queued behind it — times out. Cold start is not an initialization detail; it’s user-facing latency that determines whether your service works at all under burst traffic or pod restarts. A 70B model occupies 140 GB in FP16. Every byte must travel from NVMe through host RAM into GPU HBM before the first token can be generated, and the naive approach takes 20-44 seconds depending on which bandwidth bottleneck dominates.
The Loading Pipeline
Model loading is a four-stage pipeline:
- Deserialize - Parse the checkpoint format, locate tensor offsets
- Read - Transfer bytes from storage to host RAM
- Reconstruct - Build PyTorch tensors from raw bytes
- Upload - Copy tensors from host RAM to GPU HBM via PCIe/NVLink
Each stage has a different bottleneck. The total cold start time is dominated by whichever stage is slowest, but pipelining can overlap them.
import time
import torch
def measure_loading_stages(model_path, device="cuda:0"):
"""Measure each stage of model loading independently."""
timings = {}
# Stage 1: Deserialize (parse metadata)
t0 = time.perf_counter()
# For safetensors, this reads only the header (first 8 bytes + header JSON)
from safetensors import safe_open
f = safe_open(model_path, framework="pt", device="cpu")
tensor_names = f.keys()
timings["deserialize_ms"] = (time.perf_counter() - t0) * 1000
# Stage 2+3: Read + Reconstruct (get tensors on CPU)
t0 = time.perf_counter()
cpu_tensors = {}
for name in tensor_names:
cpu_tensors[name] = f.get_tensor(name)
timings["read_reconstruct_ms"] = (time.perf_counter() - t0) * 1000
# Stage 4: Upload to GPU
t0 = time.perf_counter()
gpu_tensors = {}
for name, tensor in cpu_tensors.items():
gpu_tensors[name] = tensor.to(device, non_blocking=True)
torch.cuda.synchronize(device)
timings["upload_ms"] = (time.perf_counter() - t0) * 1000
total_bytes = sum(t.nbytes for t in cpu_tensors.values())
timings["total_gb"] = total_bytes / (1024**3)
timings["total_ms"] = sum(v for k, v in timings.items() if k.endswith("_ms"))
return timings
Model Loading Time Breakdown (70B FP16, Single File)
| Stage | Time (s) | Bottleneck | Bandwidth Utilization |
|---|---|---|---|
| Deserialize (pickle) | 2.1 | CPU single-thread | N/A |
| Deserialize (safetensors) | 0.002 | CPU (header only) | N/A |
| Read (read syscall) | 22.4 | NVMe seq read 6.2 GB/s | 88% |
| Read (mmap + fault) | 18.7 | Page faults + readahead | 74% |
| Reconstruct (pickle) | 8.3 | CPU unpickle + malloc | N/A |
| Reconstruct (safetensors) | 0.0 | Zero-copy | N/A |
| Upload (PCIe 4.0 x16) | 8.75 | PCIe 4.0: 16 GB/s per dir | 100% |
| Upload (PCIe 5.0 x16) | 4.38 | PCIe 5.0: 32 GB/s per dir | 100% |
PyTorch Pickle vs safetensors: The Deserialization Gap
Traditional PyTorch checkpoints use Python’s pickle module. Loading a pickle file means:
- Read the entire file into memory
- Execute the pickle bytecode (this is a full Python interpreter loop)
- Allocate fresh tensors and copy data into them
- Rebuild the
state_dictstructure
This is serial, CPU-bound, and inherently unsafe (pickle can execute arbitrary Python). For a 140 GB checkpoint split across 30 shards, the pickle overhead alone can cost 5-10 seconds.
# Traditional PyTorch loading - what happens internally
import pickle
import io
def _legacy_load_checkpoint(path):
"""Simplified view of torch.load internals."""
with open(path, "rb") as f:
# Read magic number, protocol version
magic = pickle.load(f) # Executes pickle bytecode
# Read storage objects - each one allocates memory
# and copies data from the file
result = pickle.load(f) # Main unpickle - CPU bound
return result
# safetensors loading - what happens internally
from safetensors import safe_open
def _safetensors_load(path):
"""safetensors reads an 8-byte header length, then the JSON header.
Tensor data is at known offsets - no parsing needed."""
f = safe_open(path, framework="pt", device="cpu")
# Header parsing: read 8 bytes (u64 header_size)
# Then read header_size bytes of JSON
# JSON maps tensor_name -> {dtype, shape, data_offsets: [start, end]}
# Getting a tensor = mmap the file + return a view at the offset
# ZERO copy, ZERO pickle, ZERO arbitrary code execution
return {name: f.get_tensor(name) for name in f.keys()}
The safetensors format stores a fixed-size header (8-byte length prefix + JSON metadata) followed by raw tensor data at byte-aligned offsets. To load tensor "model.layers.0.self_attn.q_proj.weight", the library reads the header once, looks up the offset range [start, end], and returns a tensor view pointing directly at the mmapped file region.
safetensors Header Structure
+------------------+-------------------------------------------+
| 8 bytes (u64 LE) | JSON header (header_size bytes) |
| header_size | {"tensor_name": { |
| | "dtype": "F16", |
| | "shape": [8192, 8192], |
| | "data_offsets": [0, 134217728] |
| | }, ...} |
+------------------+-------------------------------------------+
| Tensor data region (contiguous, byte-aligned) |
| [tensor_0 bytes][padding][tensor_1 bytes][padding]... |
+--------------------------------------------------------------+
safetensors deserialization is per tensor lookup after the header is parsed. The header for a 70B model is approximately 200 KB of JSON. Compare this to pickle, which must execute bytecode instructions where is the total number of storage objects (thousands for a large model).
mmap vs read: Two Approaches to Getting Data into RAM
The read syscall and mmap represent fundamentally different strategies for getting file data into host memory.
read Syscall
import os
def load_with_read(path, buffer_size=128 * 1024 * 1024):
"""Explicit read: data flows kernel buffer -> user buffer."""
fd = os.open(path, os.O_RDONLY | os.O_DIRECT) # O_DIRECT bypasses page cache
file_size = os.fstat(fd).st_size
# Allocate aligned buffer for O_DIRECT
buf = bytearray(file_size)
offset = 0
while offset < file_size:
chunk = min(buffer_size, file_size - offset)
bytes_read = os.pread(fd, chunk, offset)
buf[offset:offset + len(bytes_read)] = bytes_read
offset += len(bytes_read)
os.close(fd)
return buf
With read:
- The kernel copies data from the page cache (or directly from disk with
O_DIRECT) into user-space memory - You control the I/O pattern: sequential reads, specific buffer sizes, prefetching hints
- Memory is allocated eagerly -
readblocks until data is in the buffer - The data is in your address space and can be freed explicitly
mmap
import mmap
import os
def load_with_mmap(path):
"""mmap: OS maps file pages directly into virtual address space.
No data is read until pages are accessed (demand paging)."""
fd = os.open(path, os.O_RDONLY)
file_size = os.fstat(fd).st_size
# Map the entire file - this returns immediately (no I/O yet)
mm = mmap.mmap(fd, file_size, access=mmap.ACCESS_READ)
# Advise the kernel about our access pattern
mm.madvise(mmap.MADV_SEQUENTIAL) # We will read sequentially
mm.madvise(mmap.MADV_WILLNEED) # Prefetch aggressively
os.close(fd) # fd can be closed - mapping persists
return mm
With mmap:
- The kernel maps file pages directly into your virtual address space
- No data is read until you touch a page (demand paging)
- Subsequent accesses to the same page hit the page cache (no syscall)
- The kernel manages eviction - pages can be dropped and re-read transparently
- No explicit memory allocation for the file data
When mmap Wins
mmap wins for model loading when:
- Multiple processes load the same model - the page cache is shared. Two vLLM workers loading Llama-70B share the same physical pages.
- You only need a subset of tensors - demand paging means untouched tensors never leave disk.
- Memory pressure is high - the kernel can evict model pages without your process being aware (they will be re-faulted if needed).
When read Wins
read wins when:
- You need all the data and want predictable latency - read gives you explicit control over I/O scheduling. mmap’s page faults cause unpredictable stalls.
- O_DIRECT is available - bypasses the page cache entirely, reducing memory pressure and avoiding double-buffering.
- NVMe parallelism - with
io_uringoraio, you can issue multiple concurrent reads to saturate the NVMe queue depth.
import ctypes
import ctypes.util
# Linux io_uring for parallel NVMe reads
# This saturates the NVMe command queue (typical depth: 128-1024)
def parallel_nvme_read(path, offsets_and_sizes):
"""Issue parallel reads via io_uring to saturate NVMe bandwidth.
Each (offset, size) pair becomes one SQE in the submission queue."""
# In practice, use liburing bindings or Python io_uring wrapper
# Simplified pseudocode:
#
# ring = io_uring_setup(queue_depth=256)
# for offset, size in offsets_and_sizes:
# sqe = ring.get_sqe()
# io_uring_prep_read(sqe, fd, buf_at_offset, size, offset)
# ring.submit() # Submit all reads at once
# ring.wait(count=len(offsets_and_sizes)) # Wait for all completions
pass
mmap vs read vs io_uring: 140 GB Model Load (NVMe Gen4)
| Method | Time (s) | Peak RSS (GB) | NVMe Queue Depth | Bandwidth (GB/s) |
|---|---|---|---|---|
| read (sequential) | 22.4 | 140 | 1 | 6.2 |
| read (O_DIRECT, 128MB buf) | 20.1 | 0.128 | 1 | 6.9 |
| mmap (MADV_SEQUENTIAL) | 18.7 | 140 (shared) | Kernel-managed | 7.4 |
| io_uring (QD=128) | 14.8 | 140 | 128 | 9.4 |
| io_uring (QD=128, O_DIRECT) | 13.2 | 0.128 | 128 | 10.6 |
safetensors Zero-Copy Loading
The key insight in safetensors is that tensor data on disk is already in the correct binary format. A float16 tensor stored in safetensors is just raw FP16 bytes in row-major order. No endian conversion, no decompression, no format transformation. This means the path from disk to GPU can be:
NVMe -> Page Cache -> mmap view -> cudaMemcpy -> GPU HBM
With zero CPU-side copies or transformations.
from safetensors import safe_open
import torch
def zero_copy_load_to_gpu(safetensors_path, device="cuda:0"):
"""Load tensors directly from safetensors to GPU.
The 'device' parameter in safe_open controls where tensors land."""
# When device="cuda:0", safetensors uses the following path:
# 1. mmap the file
# 2. For each tensor, get the mmap view (pointer + offset)
# 3. Call cudaMemcpy(gpu_ptr, mmap_ptr + offset, size, HostToDevice)
# No intermediate CPU tensor is ever allocated
f = safe_open(safetensors_path, framework="pt", device=device)
state_dict = {}
for name in f.keys():
state_dict[name] = f.get_tensor(name)
# Tensor is now on GPU. The mmap page that backed it
# can be evicted from the page cache immediately.
return state_dict
def verify_zero_copy(safetensors_path):
"""Verify that safetensors avoids CPU tensor allocation."""
import tracemalloc
tracemalloc.start()
f = safe_open(safetensors_path, framework="pt", device="cuda:0")
snapshot_before = tracemalloc.take_snapshot()
tensors = {name: f.get_tensor(name) for name in f.keys()}
snapshot_after = tracemalloc.take_snapshot()
stats = snapshot_after.compare_to(snapshot_before, "lineno")
# CPU memory increase should be minimal (only metadata, not tensor data)
total_cpu_alloc = sum(s.size_diff for s in stats if s.size_diff > 0)
total_gpu_bytes = sum(t.nbytes for t in tensors.values())
print(f"GPU memory allocated: {total_gpu_bytes / 1e9:.2f} GB")
print(f"CPU memory allocated: {total_cpu_alloc / 1e6:.2f} MB")
# Expect: GPU ~140 GB, CPU ~50 MB (metadata only)
safetensors with device="cuda:0" achieves true zero-copy on Linux: the file is mmapped, and cudaMemcpy reads directly from the mmap region. On Windows, the behavior depends on the CUDA driver’s ability to pin mmap pages. In both cases, no intermediate CPU torch.Tensor objects are allocated for the weight data.
Progressive Loading: First Token While Still Loading
The standard approach loads all weights, then starts inference. For a 140 GB model taking 20 seconds to load, that is 20 seconds of zero useful work. Progressive loading starts inference on the first layers while later layers are still loading.
The key insight: transformer inference is sequential through layers. When executing layer 0, layers 1-79 are not needed yet. If we can pipeline loading and execution, we can produce the first token before the entire model is in GPU memory.
import torch
import threading
import queue
from safetensors import safe_open
class ProgressiveModelLoader:
"""Load transformer layers progressively, enabling inference
to start on early layers while later layers are still loading."""
def __init__(self, model_config, safetensors_paths, device="cuda:0"):
self.config = model_config
self.paths = safetensors_paths
self.device = device
self.num_layers = model_config.num_hidden_layers
# Layer readiness tracking
self.layer_ready = [threading.Event() for _ in range(self.num_layers)]
self.layer_weights = [None] * self.num_layers
# Streams for overlapping load and compute
self.load_stream = torch.cuda.Stream(device=device)
self.compute_stream = torch.cuda.Stream(device=device)
def _load_layer_weights(self, layer_idx):
"""Load weights for a single transformer layer."""
prefix = f"model.layers.{layer_idx}."
layer_state = {}
for path in self.paths:
f = safe_open(path, framework="pt", device="cpu")
for name in f.keys():
if name.startswith(prefix):
short_name = name[len(prefix):]
with torch.cuda.stream(self.load_stream):
tensor = f.get_tensor(name).to(
self.device, non_blocking=True
)
layer_state[short_name] = tensor
# Synchronize the load stream to ensure all transfers complete
self.load_stream.synchronize()
return layer_state
def start_loading(self):
"""Start background loading of all layers."""
def _load_all():
# Load embedding first (needed before any layer)
self._load_embedding()
# Load transformer layers sequentially
for i in range(self.num_layers):
self.layer_weights[i] = self._load_layer_weights(i)
self.layer_ready[i].set() # Signal that layer i is ready
# Load output head last
self._load_output_head()
self.load_thread = threading.Thread(target=_load_all, daemon=True)
self.load_thread.start()
def forward_layer(self, layer_idx, hidden_states):
"""Execute a single layer, waiting for its weights if necessary."""
# Block until this layer's weights are loaded
self.layer_ready[layer_idx].wait()
weights = self.layer_weights[layer_idx]
with torch.cuda.stream(self.compute_stream):
# Execute the layer using loaded weights
hidden_states = self._execute_transformer_layer(
hidden_states, weights, layer_idx
)
return hidden_states
def forward(self, input_ids):
"""Full forward pass with progressive loading."""
hidden_states = self.embedding(input_ids)
for i in range(self.num_layers):
hidden_states = self.forward_layer(i, hidden_states)
logits = self.output_head(hidden_states)
return logits
Pipelining Analysis
For a model with layers, let be the time to load one layer from disk to GPU, and be the time to execute one layer on an input.
Sequential loading: Total time =
Progressive loading: Total time = (if , the execution is fully hidden behind loading)
More precisely, if loading is the bottleneck ():
If execution is the bottleneck ():
For a 70B model on NVMe Gen4: , (prefill, batch=1, seq=512). Loading dominates, so progressive loading saves of the total cold start. The real win is that inference can begin producing the first token approximately 265 ms after loading starts (one layer load + one layer exec) instead of waiting the full 20 seconds.
Cold Start Time: Sequential vs Progressive Loading (70B FP16)
| Metric | Sequential (pickle) | Sequential (safetensors) | Progressive (safetensors) | Progressive + NVMe RAID | ModelExpress (snapshot) |
|---|---|---|---|---|---|
| Time to First Token (seconds) |
NVMe Optimization for Model Loading
A single NVMe Gen4 drive delivers 7 GB/s sequential read. Loading 140 GB takes 20 seconds. To go faster, we need either faster drives or more drives.
NVMe RAID-0 Striping
Striping the model file across multiple NVMe drives gives linear bandwidth scaling:
# Create a RAID-0 across 4 NVMe drives
# This gives ~28 GB/s aggregate sequential read
mdadm --create /dev/md0 --level=0 --raid-devices=4 \
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
# With 4x Gen4 NVMe in RAID-0:
# 140 GB / 28 GB/s = 5.0 seconds
Sharded Loading with Worker Pool
For safetensors, the format supports sharded files natively. Each shard can be read from a different NVMe drive:
import concurrent.futures
from pathlib import Path
def sharded_parallel_load(shard_dir, device="cuda:0", max_workers=8):
"""Load safetensors shards in parallel from multiple NVMe drives.
Assumes shards are distributed across mount points."""
shard_paths = sorted(Path(shard_dir).glob("model-*.safetensors"))
state_dict = {}
lock = threading.Lock()
def load_shard(path):
"""Load one shard file."""
f = safe_open(str(path), framework="pt", device=device)
local_dict = {}
for name in f.keys():
local_dict[name] = f.get_tensor(name)
return local_dict
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {pool.submit(load_shard, p): p for p in shard_paths}
for future in concurrent.futures.as_completed(futures):
shard_dict = future.result()
with lock:
state_dict.update(shard_dict)
return state_dict
GDS (GPU Direct Storage)
NVIDIA GPUDirect Storage (GDS) eliminates the CPU from the data path entirely. Data flows directly from NVMe to GPU HBM through the PCIe fabric:
Standard path: NVMe -> CPU RAM (page cache) -> cudaMemcpy -> GPU HBM
GDS path: NVMe -> PCIe switch -> GPU HBM (no CPU involvement)
# Using cuFile API for GPU Direct Storage
# This requires GDS-capable NVMe drives and NVIDIA MOFED/GDS drivers
import kvikio # Python bindings for cuFile
def gds_load_tensor(file_path, offset, size, gpu_buffer):
"""Load tensor data directly from NVMe to GPU using GDS."""
f = kvikio.CuFile(file_path, "r")
# This bypasses the CPU entirely:
# NVMe controller DMAs data to GPU HBM via PCIe
bytes_read = f.pread(gpu_buffer, size, file_offset=offset)
f.close()
return bytes_read
def gds_load_safetensors(path, device="cuda:0"):
"""Load an entire safetensors file using GDS."""
# First, parse the header on CPU (small, fast)
import json
with open(path, "rb") as f:
header_size = int.from_bytes(f.read(8), "little")
header = json.loads(f.read(header_size))
data_offset = 8 + header_size
tensors = {}
cf = kvikio.CuFile(path, "r")
for name, meta in header.items():
if name == "__metadata__":
continue
start, end = meta["data_offsets"]
size = end - start
dtype = {"F16": torch.float16, "BF16": torch.bfloat16,
"F32": torch.float32, "I32": torch.int32,
"U8": torch.uint8}[meta["dtype"]]
shape = meta["shape"]
# Allocate GPU tensor
gpu_tensor = torch.empty(shape, dtype=dtype, device=device)
# DMA directly from NVMe to GPU
cf.pread(gpu_tensor, size, file_offset=data_offset + start)
tensors[name] = gpu_tensor
cf.close()
return tensors
Model Loading Bandwidth: CPU Path vs GDS (H100, NVMe Gen4x4)
| Method | Bandwidth (GB/s) | CPU Utilization | Bounce Buffer |
|---|---|---|---|
| read + cudaMemcpy | 6.2 | 100% (1 core) | Yes (page cache) |
| mmap + cudaMemcpy | 7.4 | ~30% (page faults) | Yes (page cache) |
| GDS (cuFile) | 12.8 | ~5% (setup only) | No |
| GDS + 4x NVMe RAID-0 | 24.1 | ~5% | No |
NVIDIA ModelExpress: Sub-200ms Cold Start
NVIDIA Dynamo’s ModelExpress takes a fundamentally different approach: instead of loading from a file, it restores a GPU memory snapshot. The model weights are pre-loaded into a persistent memory region (either GPU HBM via MPS, or a shared memory buffer) and new inference processes attach to the existing weights.
How ModelExpress Works
Traditional cold start:
Process start -> Parse config -> Load weights -> Build model -> Ready
[0ms] [50ms] [20000ms] [200ms] [20250ms]
ModelExpress cold start:
Process start -> Attach to shared weights -> Build model skeleton -> Ready
[0ms] [10ms] [50ms] [60ms]
The key insight: model weights are read-only during inference. There is no reason to copy 140 GB of weights into each process’s private memory when they can share a single copy.
class ModelExpressLoader:
"""Simplified ModelExpress-style shared weight loading.
Pre-loads weights into CUDA IPC-shared memory."""
def __init__(self, model_path, device="cuda:0"):
self.device = device
self.model_path = model_path
def create_snapshot(self):
"""One-time: load weights and create a shared snapshot."""
import pickle
from safetensors import safe_open
f = safe_open(self.model_path, framework="pt", device=self.device)
shared_tensors = {}
for name in f.keys():
tensor = f.get_tensor(name)
# Make tensor shareable via CUDA IPC
# This allocates from the CUDA IPC-capable memory pool
shared_tensors[name] = tensor
# Save IPC handles for each tensor
ipc_handles = {}
for name, tensor in shared_tensors.items():
# Get the IPC memory handle for this tensor's storage
handle = torch.cuda.ipc_collect()
ipc_handles[name] = {
"handle": tensor.storage()._share_cuda_(),
"shape": tensor.shape,
"dtype": tensor.dtype,
}
return ipc_handles, shared_tensors
def attach_to_snapshot(self, ipc_handles):
"""Fast: attach to pre-loaded weights via IPC handles.
This takes milliseconds, not seconds."""
state_dict = {}
for name, info in ipc_handles.items():
# Reconstruct tensor from IPC handle - no data copy
storage = torch.cuda.StorageFromIPCHandle(info["handle"])
tensor = torch.tensor([], dtype=info["dtype"])
tensor.set_(storage, 0, info["shape"])
state_dict[name] = tensor
return state_dict
NIXL: Network-Aware Loading
NVIDIA’s NIXL (Network Interface for Cross-node Loading) extends this concept across nodes. When a new node joins the inference cluster, instead of loading from NVMe, it pulls weights from an existing node’s GPU memory over RDMA:
NVMe loading: 140 GB / 7 GB/s = 20.0 seconds
RDMA from peer: 140 GB / 400 Gbps = 2.8 seconds (InfiniBand NDR)
Shared memory: 140 GB / attach = 0.06 seconds (IPC handles only)
class NIXLWeightTransfer:
"""Transfer model weights between nodes via RDMA.
Uses GPUDirect RDMA: GPU HBM on node A -> NIC -> NIC -> GPU HBM on node B."""
def __init__(self, local_rank, world_size):
self.local_rank = local_rank
self.world_size = world_size
def pull_weights_from_peer(self, peer_rank, tensor_names):
"""Pull weights from a peer node's GPU via RDMA."""
import torch.distributed as dist
received_tensors = {}
for name in tensor_names:
# Allocate local buffer
meta = self._get_tensor_meta(peer_rank, name)
local_buf = torch.empty(
meta["shape"], dtype=meta["dtype"],
device=f"cuda:{self.local_rank}"
)
# NCCL recv uses GPUDirect RDMA when available
dist.recv(local_buf, src=peer_rank)
received_tensors[name] = local_buf
return received_tensors
def serve_weights_to_peer(self, dest_rank, state_dict):
"""Send local weights to a requesting peer."""
import torch.distributed as dist
for name, tensor in state_dict.items():
dist.send(tensor, dst=dest_rank)
Cold Start Budget Breakdown
For production systems, cold start has a strict budget. Here is the breakdown for different deployment scenarios:
Cold Start Budget by Deployment Scenario
| Scenario | Budget | Strategy | Achievable |
|---|---|---|---|
| Batch inference (offline) | Minutes OK | Standard safetensors load | 20s |
| Serverless (scale-from-zero) | Less than 30s | Sharded + parallel NVMe + progressive | 8-12s |
| Autoscaling (add replica) | Less than 10s | GDS + RAID-0 + progressive | 5-7s |
| Failover (replace dead node) | Less than 5s | RDMA from peer node | 2.8s |
| Hot standby (MPS sharing) | Less than 200ms | ModelExpress (IPC attach) | 60ms |
Putting It All Together: Production Loading Pipeline
class ProductionModelLoader:
"""Production model loader combining all optimizations."""
def __init__(self, config):
self.config = config
self.device = config.device
def load(self):
"""Load model using the fastest available method."""
# Priority 1: Shared memory (ModelExpress-style)
if self._shared_snapshot_available():
return self._attach_shared(
self._get_ipc_handles()
) # ~60ms
# Priority 2: RDMA from peer
if self._peer_available():
return self._rdma_pull() # ~3s
# Priority 3: GDS + parallel NVMe
if self._gds_available():
return self._gds_parallel_load() # ~6s
# Priority 4: safetensors + progressive loading
return self._progressive_safetensors_load() # ~19s
def _progressive_safetensors_load(self):
"""Fallback: progressive safetensors loading."""
loader = ProgressiveModelLoader(
self.config, self._get_shard_paths(), self.device
)
loader.start_loading()
return loader
def _gds_parallel_load(self):
"""GDS loading from multiple NVMe drives."""
state_dict = {}
for path in self._get_shard_paths():
shard = gds_load_safetensors(str(path), self.device)
state_dict.update(shard)
return state_dict
Loading Bandwidth Scaling: 1-8 NVMe Drives (140 GB Model)
line| Metric | 1 drive | 2 drives | 4 drives | 8 drives |
|---|---|---|---|---|
| read + cudaMemcpy (GB/s) | ||||
| GDS (GB/s) | ||||
| PCIe 4.0 limit (GB/s) | ||||
| PCIe 5.0 limit (GB/s) |
Checkpoint Format Comparison
The choice of checkpoint format has a direct impact on loading performance:
Checkpoint Format Performance (70B FP16, NVMe Gen4)
| Format | File Size | Deserialization | Supports mmap | Supports GDS | Safety |
|---|---|---|---|---|---|
| PyTorch (.bin) | 140 GB | 2.1s (pickle) | No | No | Unsafe (arbitrary code) |
| safetensors (.safetensors) | 140 GB | 2ms (JSON header) | Yes | Yes | Safe (no code exec) |
| GGUF (.gguf) | Varies (quantized) | 50ms (custom header) | Yes | No | Safe |
| TensorRT engine (.plan) | ~70 GB (optimized) | 500ms (deserialize) | No | No | Safe |
Memory-Mapped Weight Sharing Across Processes
When running multiple model replicas on the same machine (common with tensor parallelism or multiple independent workers), mmap enables physical memory sharing:
import torch.multiprocessing as mp
def worker(rank, model_path, device_id):
"""Each worker mmaps the same file.
The OS shares physical pages across all workers."""
from safetensors import safe_open
# All workers mmap the same file
# Linux kernel shares the physical pages in page cache
f = safe_open(model_path, framework="pt", device=f"cuda:{device_id}")
state_dict = {}
for name in f.keys():
state_dict[name] = f.get_tensor(name)
# Memory usage: each worker uses ~0 extra host RAM
# because mmap pages are shared via the page cache
return state_dict
def launch_workers(model_path, num_workers=8):
"""Launch 8 workers that share mmap pages."""
mp.spawn(
worker,
args=(model_path,),
nprocs=num_workers,
)
# Total host RAM for model weights: 140 GB (shared)
# NOT 140 GB x 8 = 1120 GB
mmap sharing only helps with host RAM. Each GPU still needs its own copy of the weights in HBM. For GPU-side sharing, use CUDA IPC (as in ModelExpress) or MPS (Multi-Process Service), which allows multiple processes to share a single GPU context and its memory allocations.
Benchmarking Your Loading Pipeline
import time
import json
import psutil
import torch
def benchmark_loading(loader_fn, model_path, device, warmup=1, trials=3):
"""Comprehensive loading benchmark."""
results = []
for trial in range(warmup + trials):
# Clear caches
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
# Drop page cache (requires root)
# os.system("echo 3 > /proc/sys/vm/drop_caches")
process = psutil.Process()
mem_before = process.memory_info().rss
t_start = time.perf_counter()
state_dict = loader_fn(model_path, device)
torch.cuda.synchronize()
t_end = time.perf_counter()
mem_after = process.memory_info().rss
gpu_peak = torch.cuda.max_memory_allocated() / 1e9
if trial >= warmup:
total_bytes = sum(t.nbytes for t in state_dict.values())
elapsed = t_end - t_start
results.append({
"trial": trial - warmup,
"elapsed_s": elapsed,
"bandwidth_gbps": (total_bytes / 1e9) / elapsed,
"host_ram_delta_gb": (mem_after - mem_before) / 1e9,
"gpu_peak_gb": gpu_peak,
"num_tensors": len(state_dict),
})
return results
The loading pipeline is the foundation of cold start performance. Every optimization downstream (quantization, compilation, CUDA graph capture) is irrelevant if the model takes 30 seconds to get into GPU memory. The path from safetensors + mmap (20s) to ModelExpress-style snapshot attachment (60ms) represents a 300x improvement, and each step along that path is a concrete engineering choice with clear tradeoffs in complexity, hardware requirements, and deployment flexibility.