Part of Series Inference Optimization Timeline 16 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Your inference service crashed and auto-scaled a fresh GPU. The first request arrives 200 milliseconds later, but your model takes 20 seconds to load from NVMe. That request — and every other one queued behind it — times out. Cold start is not an initialization detail; it’s user-facing latency that determines whether your service works at all under burst traffic or pod restarts. A 70B model occupies 140 GB in FP16. Every byte must travel from NVMe through host RAM into GPU HBM before the first token can be generated, and the naive approach takes 20-44 seconds depending on which bandwidth bottleneck dominates.

The Loading Pipeline

Model loading is a four-stage pipeline:

  1. Deserialize - Parse the checkpoint format, locate tensor offsets
  2. Read - Transfer bytes from storage to host RAM
  3. Reconstruct - Build PyTorch tensors from raw bytes
  4. Upload - Copy tensors from host RAM to GPU HBM via PCIe/NVLink

Each stage has a different bottleneck. The total cold start time is dominated by whichever stage is slowest, but pipelining can overlap them.

import time
import torch

def measure_loading_stages(model_path, device="cuda:0"):
    """Measure each stage of model loading independently."""
    timings = {}

    # Stage 1: Deserialize (parse metadata)
    t0 = time.perf_counter()
    # For safetensors, this reads only the header (first 8 bytes + header JSON)
    from safetensors import safe_open
    f = safe_open(model_path, framework="pt", device="cpu")
    tensor_names = f.keys()
    timings["deserialize_ms"] = (time.perf_counter() - t0) * 1000

    # Stage 2+3: Read + Reconstruct (get tensors on CPU)
    t0 = time.perf_counter()
    cpu_tensors = {}
    for name in tensor_names:
        cpu_tensors[name] = f.get_tensor(name)
    timings["read_reconstruct_ms"] = (time.perf_counter() - t0) * 1000

    # Stage 4: Upload to GPU
    t0 = time.perf_counter()
    gpu_tensors = {}
    for name, tensor in cpu_tensors.items():
        gpu_tensors[name] = tensor.to(device, non_blocking=True)
    torch.cuda.synchronize(device)
    timings["upload_ms"] = (time.perf_counter() - t0) * 1000

    total_bytes = sum(t.nbytes for t in cpu_tensors.values())
    timings["total_gb"] = total_bytes / (1024**3)
    timings["total_ms"] = sum(v for k, v in timings.items() if k.endswith("_ms"))
    return timings
📊

Model Loading Time Breakdown (70B FP16, Single File)

StageTime (s)BottleneckBandwidth Utilization
Deserialize (pickle) 2.1 CPU single-thread N/A
Deserialize (safetensors) 0.002 CPU (header only) N/A
Read (read syscall) 22.4 NVMe seq read 6.2 GB/s 88%
Read (mmap + fault) 18.7 Page faults + readahead 74%
Reconstruct (pickle) 8.3 CPU unpickle + malloc N/A
Reconstruct (safetensors) 0.0 Zero-copy N/A
Upload (PCIe 4.0 x16) 8.75 PCIe 4.0: 16 GB/s per dir 100%
Upload (PCIe 5.0 x16) 4.38 PCIe 5.0: 32 GB/s per dir 100%

PyTorch Pickle vs safetensors: The Deserialization Gap

Traditional PyTorch checkpoints use Python’s pickle module. Loading a pickle file means:

  1. Read the entire file into memory
  2. Execute the pickle bytecode (this is a full Python interpreter loop)
  3. Allocate fresh tensors and copy data into them
  4. Rebuild the state_dict structure

This is serial, CPU-bound, and inherently unsafe (pickle can execute arbitrary Python). For a 140 GB checkpoint split across 30 shards, the pickle overhead alone can cost 5-10 seconds.

# Traditional PyTorch loading - what happens internally
import pickle
import io

def _legacy_load_checkpoint(path):
    """Simplified view of torch.load internals."""
    with open(path, "rb") as f:
        # Read magic number, protocol version
        magic = pickle.load(f)  # Executes pickle bytecode
        # Read storage objects - each one allocates memory
        # and copies data from the file
        result = pickle.load(f)  # Main unpickle - CPU bound
    return result

# safetensors loading - what happens internally
from safetensors import safe_open

def _safetensors_load(path):
    """safetensors reads an 8-byte header length, then the JSON header.
    Tensor data is at known offsets - no parsing needed."""
    f = safe_open(path, framework="pt", device="cpu")
    # Header parsing: read 8 bytes (u64 header_size)
    # Then read header_size bytes of JSON
    # JSON maps tensor_name -> {dtype, shape, data_offsets: [start, end]}
    # Getting a tensor = mmap the file + return a view at the offset
    # ZERO copy, ZERO pickle, ZERO arbitrary code execution
    return {name: f.get_tensor(name) for name in f.keys()}

The safetensors format stores a fixed-size header (8-byte length prefix + JSON metadata) followed by raw tensor data at byte-aligned offsets. To load tensor "model.layers.0.self_attn.q_proj.weight", the library reads the header once, looks up the offset range [start, end], and returns a tensor view pointing directly at the mmapped file region.

safetensors Header Structure

+------------------+-------------------------------------------+
| 8 bytes (u64 LE) | JSON header (header_size bytes)           |
| header_size      | {"tensor_name": {                         |
|                  |    "dtype": "F16",                        |
|                  |    "shape": [8192, 8192],                 |
|                  |    "data_offsets": [0, 134217728]         |
|                  |  }, ...}                                  |
+------------------+-------------------------------------------+
| Tensor data region (contiguous, byte-aligned)                |
| [tensor_0 bytes][padding][tensor_1 bytes][padding]...        |
+--------------------------------------------------------------+
Performance

safetensors deserialization is O(1)O(1) per tensor lookup after the header is parsed. The header for a 70B model is approximately 200 KB of JSON. Compare this to pickle, which must execute O(N)O(N) bytecode instructions where NN is the total number of storage objects (thousands for a large model).

mmap vs read: Two Approaches to Getting Data into RAM

The read syscall and mmap represent fundamentally different strategies for getting file data into host memory.

read Syscall

import os

def load_with_read(path, buffer_size=128 * 1024 * 1024):
    """Explicit read: data flows kernel buffer -> user buffer."""
    fd = os.open(path, os.O_RDONLY | os.O_DIRECT)  # O_DIRECT bypasses page cache
    file_size = os.fstat(fd).st_size
    # Allocate aligned buffer for O_DIRECT
    buf = bytearray(file_size)
    offset = 0
    while offset < file_size:
        chunk = min(buffer_size, file_size - offset)
        bytes_read = os.pread(fd, chunk, offset)
        buf[offset:offset + len(bytes_read)] = bytes_read
        offset += len(bytes_read)
    os.close(fd)
    return buf

With read:

  • The kernel copies data from the page cache (or directly from disk with O_DIRECT) into user-space memory
  • You control the I/O pattern: sequential reads, specific buffer sizes, prefetching hints
  • Memory is allocated eagerly - read blocks until data is in the buffer
  • The data is in your address space and can be freed explicitly

mmap

import mmap
import os

def load_with_mmap(path):
    """mmap: OS maps file pages directly into virtual address space.
    No data is read until pages are accessed (demand paging)."""
    fd = os.open(path, os.O_RDONLY)
    file_size = os.fstat(fd).st_size

    # Map the entire file - this returns immediately (no I/O yet)
    mm = mmap.mmap(fd, file_size, access=mmap.ACCESS_READ)

    # Advise the kernel about our access pattern
    mm.madvise(mmap.MADV_SEQUENTIAL)  # We will read sequentially
    mm.madvise(mmap.MADV_WILLNEED)   # Prefetch aggressively

    os.close(fd)  # fd can be closed - mapping persists
    return mm

With mmap:

  • The kernel maps file pages directly into your virtual address space
  • No data is read until you touch a page (demand paging)
  • Subsequent accesses to the same page hit the page cache (no syscall)
  • The kernel manages eviction - pages can be dropped and re-read transparently
  • No explicit memory allocation for the file data

When mmap Wins

mmap wins for model loading when:

  1. Multiple processes load the same model - the page cache is shared. Two vLLM workers loading Llama-70B share the same physical pages.
  2. You only need a subset of tensors - demand paging means untouched tensors never leave disk.
  3. Memory pressure is high - the kernel can evict model pages without your process being aware (they will be re-faulted if needed).

When read Wins

read wins when:

  1. You need all the data and want predictable latency - read gives you explicit control over I/O scheduling. mmap’s page faults cause unpredictable stalls.
  2. O_DIRECT is available - bypasses the page cache entirely, reducing memory pressure and avoiding double-buffering.
  3. NVMe parallelism - with io_uring or aio, you can issue multiple concurrent reads to saturate the NVMe queue depth.
import ctypes
import ctypes.util

# Linux io_uring for parallel NVMe reads
# This saturates the NVMe command queue (typical depth: 128-1024)
def parallel_nvme_read(path, offsets_and_sizes):
    """Issue parallel reads via io_uring to saturate NVMe bandwidth.
    Each (offset, size) pair becomes one SQE in the submission queue."""
    # In practice, use liburing bindings or Python io_uring wrapper
    # Simplified pseudocode:
    #
    # ring = io_uring_setup(queue_depth=256)
    # for offset, size in offsets_and_sizes:
    #     sqe = ring.get_sqe()
    #     io_uring_prep_read(sqe, fd, buf_at_offset, size, offset)
    # ring.submit()  # Submit all reads at once
    # ring.wait(count=len(offsets_and_sizes))  # Wait for all completions
    pass
📊

mmap vs read vs io_uring: 140 GB Model Load (NVMe Gen4)

MethodTime (s)Peak RSS (GB)NVMe Queue DepthBandwidth (GB/s)
read (sequential) 22.4 140 1 6.2
read (O_DIRECT, 128MB buf) 20.1 0.128 1 6.9
mmap (MADV_SEQUENTIAL) 18.7 140 (shared) Kernel-managed 7.4
io_uring (QD=128) 14.8 140 128 9.4
io_uring (QD=128, O_DIRECT) 13.2 0.128 128 10.6

safetensors Zero-Copy Loading

The key insight in safetensors is that tensor data on disk is already in the correct binary format. A float16 tensor stored in safetensors is just raw FP16 bytes in row-major order. No endian conversion, no decompression, no format transformation. This means the path from disk to GPU can be:

NVMe -> Page Cache -> mmap view -> cudaMemcpy -> GPU HBM

With zero CPU-side copies or transformations.

from safetensors import safe_open
import torch

def zero_copy_load_to_gpu(safetensors_path, device="cuda:0"):
    """Load tensors directly from safetensors to GPU.
    The 'device' parameter in safe_open controls where tensors land."""

    # When device="cuda:0", safetensors uses the following path:
    # 1. mmap the file
    # 2. For each tensor, get the mmap view (pointer + offset)
    # 3. Call cudaMemcpy(gpu_ptr, mmap_ptr + offset, size, HostToDevice)
    # No intermediate CPU tensor is ever allocated
    f = safe_open(safetensors_path, framework="pt", device=device)

    state_dict = {}
    for name in f.keys():
        state_dict[name] = f.get_tensor(name)
        # Tensor is now on GPU. The mmap page that backed it
        # can be evicted from the page cache immediately.

    return state_dict

def verify_zero_copy(safetensors_path):
    """Verify that safetensors avoids CPU tensor allocation."""
    import tracemalloc
    tracemalloc.start()

    f = safe_open(safetensors_path, framework="pt", device="cuda:0")
    snapshot_before = tracemalloc.take_snapshot()

    tensors = {name: f.get_tensor(name) for name in f.keys()}

    snapshot_after = tracemalloc.take_snapshot()
    stats = snapshot_after.compare_to(snapshot_before, "lineno")

    # CPU memory increase should be minimal (only metadata, not tensor data)
    total_cpu_alloc = sum(s.size_diff for s in stats if s.size_diff > 0)
    total_gpu_bytes = sum(t.nbytes for t in tensors.values())

    print(f"GPU memory allocated: {total_gpu_bytes / 1e9:.2f} GB")
    print(f"CPU memory allocated: {total_cpu_alloc / 1e6:.2f} MB")
    # Expect: GPU ~140 GB, CPU ~50 MB (metadata only)
ℹ️ Note

safetensors with device="cuda:0" achieves true zero-copy on Linux: the file is mmapped, and cudaMemcpy reads directly from the mmap region. On Windows, the behavior depends on the CUDA driver’s ability to pin mmap pages. In both cases, no intermediate CPU torch.Tensor objects are allocated for the weight data.

Progressive Loading: First Token While Still Loading

The standard approach loads all weights, then starts inference. For a 140 GB model taking 20 seconds to load, that is 20 seconds of zero useful work. Progressive loading starts inference on the first layers while later layers are still loading.

The key insight: transformer inference is sequential through layers. When executing layer 0, layers 1-79 are not needed yet. If we can pipeline loading and execution, we can produce the first token before the entire model is in GPU memory.

import torch
import threading
import queue
from safetensors import safe_open

class ProgressiveModelLoader:
    """Load transformer layers progressively, enabling inference
    to start on early layers while later layers are still loading."""

    def __init__(self, model_config, safetensors_paths, device="cuda:0"):
        self.config = model_config
        self.paths = safetensors_paths
        self.device = device
        self.num_layers = model_config.num_hidden_layers

        # Layer readiness tracking
        self.layer_ready = [threading.Event() for _ in range(self.num_layers)]
        self.layer_weights = [None] * self.num_layers

        # Streams for overlapping load and compute
        self.load_stream = torch.cuda.Stream(device=device)
        self.compute_stream = torch.cuda.Stream(device=device)

    def _load_layer_weights(self, layer_idx):
        """Load weights for a single transformer layer."""
        prefix = f"model.layers.{layer_idx}."
        layer_state = {}

        for path in self.paths:
            f = safe_open(path, framework="pt", device="cpu")
            for name in f.keys():
                if name.startswith(prefix):
                    short_name = name[len(prefix):]
                    with torch.cuda.stream(self.load_stream):
                        tensor = f.get_tensor(name).to(
                            self.device, non_blocking=True
                        )
                    layer_state[short_name] = tensor

        # Synchronize the load stream to ensure all transfers complete
        self.load_stream.synchronize()
        return layer_state

    def start_loading(self):
        """Start background loading of all layers."""
        def _load_all():
            # Load embedding first (needed before any layer)
            self._load_embedding()

            # Load transformer layers sequentially
            for i in range(self.num_layers):
                self.layer_weights[i] = self._load_layer_weights(i)
                self.layer_ready[i].set()  # Signal that layer i is ready

            # Load output head last
            self._load_output_head()

        self.load_thread = threading.Thread(target=_load_all, daemon=True)
        self.load_thread.start()

    def forward_layer(self, layer_idx, hidden_states):
        """Execute a single layer, waiting for its weights if necessary."""
        # Block until this layer's weights are loaded
        self.layer_ready[layer_idx].wait()

        weights = self.layer_weights[layer_idx]
        with torch.cuda.stream(self.compute_stream):
            # Execute the layer using loaded weights
            hidden_states = self._execute_transformer_layer(
                hidden_states, weights, layer_idx
            )
        return hidden_states

    def forward(self, input_ids):
        """Full forward pass with progressive loading."""
        hidden_states = self.embedding(input_ids)

        for i in range(self.num_layers):
            hidden_states = self.forward_layer(i, hidden_states)

        logits = self.output_head(hidden_states)
        return logits

Pipelining Analysis

For a model with LL layers, let tloadt_{\text{load}} be the time to load one layer from disk to GPU, and texect_{\text{exec}} be the time to execute one layer on an input.

Sequential loading: Total time = Ltload+LtexecL \cdot t_{\text{load}} + L \cdot t_{\text{exec}}

Progressive loading: Total time = Ltload+texecL \cdot t_{\text{load}} + t_{\text{exec}} (if tloadtexect_{\text{load}} \geq t_{\text{exec}}, the execution is fully hidden behind loading)

More precisely, if loading is the bottleneck (tload>texect_{\text{load}} > t_{\text{exec}}):

Tprogressive=tembed_load+Ltload+texecT_{\text{progressive}} = t_{\text{embed\_load}} + L \cdot t_{\text{load}} + t_{\text{exec}}

If execution is the bottleneck (texec>tloadt_{\text{exec}} > t_{\text{load}}):

Tprogressive=Ltload+Ltexec(L1)tload=tload+LtexecT_{\text{progressive}} = L \cdot t_{\text{load}} + L \cdot t_{\text{exec}} - (L-1) \cdot t_{\text{load}} = t_{\text{load}} + L \cdot t_{\text{exec}}

For a 70B model on NVMe Gen4: tload250ms/layert_{\text{load}} \approx 250\text{ms/layer}, texec15ms/layert_{\text{exec}} \approx 15\text{ms/layer} (prefill, batch=1, seq=512). Loading dominates, so progressive loading saves 8015ms=1.2s80 \cdot 15\text{ms} = 1.2\text{s} of the total cold start. The real win is that inference can begin producing the first token approximately 265 ms after loading starts (one layer load + one layer exec) instead of waiting the full 20 seconds.

Cold Start Time: Sequential vs Progressive Loading (70B FP16)

Metric Sequential (pickle)Sequential (safetensors)Progressive (safetensors)Progressive + NVMe RAIDModelExpress (snapshot)
Time to First Token (seconds)
32.5
20.1
18.9
9.4
0.18

NVMe Optimization for Model Loading

A single NVMe Gen4 drive delivers 7 GB/s sequential read. Loading 140 GB takes 20 seconds. To go faster, we need either faster drives or more drives.

NVMe RAID-0 Striping

Striping the model file across multiple NVMe drives gives linear bandwidth scaling:

# Create a RAID-0 across 4 NVMe drives
# This gives ~28 GB/s aggregate sequential read
mdadm --create /dev/md0 --level=0 --raid-devices=4 \
    /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

# With 4x Gen4 NVMe in RAID-0:
# 140 GB / 28 GB/s = 5.0 seconds

Sharded Loading with Worker Pool

For safetensors, the format supports sharded files natively. Each shard can be read from a different NVMe drive:

import concurrent.futures
from pathlib import Path

def sharded_parallel_load(shard_dir, device="cuda:0", max_workers=8):
    """Load safetensors shards in parallel from multiple NVMe drives.
    Assumes shards are distributed across mount points."""

    shard_paths = sorted(Path(shard_dir).glob("model-*.safetensors"))
    state_dict = {}
    lock = threading.Lock()

    def load_shard(path):
        """Load one shard file."""
        f = safe_open(str(path), framework="pt", device=device)
        local_dict = {}
        for name in f.keys():
            local_dict[name] = f.get_tensor(name)
        return local_dict

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = {pool.submit(load_shard, p): p for p in shard_paths}
        for future in concurrent.futures.as_completed(futures):
            shard_dict = future.result()
            with lock:
                state_dict.update(shard_dict)

    return state_dict

GDS (GPU Direct Storage)

NVIDIA GPUDirect Storage (GDS) eliminates the CPU from the data path entirely. Data flows directly from NVMe to GPU HBM through the PCIe fabric:

Standard path:  NVMe -> CPU RAM (page cache) -> cudaMemcpy -> GPU HBM
GDS path:       NVMe -> PCIe switch -> GPU HBM (no CPU involvement)
# Using cuFile API for GPU Direct Storage
# This requires GDS-capable NVMe drives and NVIDIA MOFED/GDS drivers
import kvikio  # Python bindings for cuFile

def gds_load_tensor(file_path, offset, size, gpu_buffer):
    """Load tensor data directly from NVMe to GPU using GDS."""
    f = kvikio.CuFile(file_path, "r")

    # This bypasses the CPU entirely:
    # NVMe controller DMAs data to GPU HBM via PCIe
    bytes_read = f.pread(gpu_buffer, size, file_offset=offset)

    f.close()
    return bytes_read

def gds_load_safetensors(path, device="cuda:0"):
    """Load an entire safetensors file using GDS."""
    # First, parse the header on CPU (small, fast)
    import json
    with open(path, "rb") as f:
        header_size = int.from_bytes(f.read(8), "little")
        header = json.loads(f.read(header_size))

    data_offset = 8 + header_size

    tensors = {}
    cf = kvikio.CuFile(path, "r")

    for name, meta in header.items():
        if name == "__metadata__":
            continue
        start, end = meta["data_offsets"]
        size = end - start
        dtype = {"F16": torch.float16, "BF16": torch.bfloat16,
                 "F32": torch.float32, "I32": torch.int32,
                 "U8": torch.uint8}[meta["dtype"]]
        shape = meta["shape"]

        # Allocate GPU tensor
        gpu_tensor = torch.empty(shape, dtype=dtype, device=device)
        # DMA directly from NVMe to GPU
        cf.pread(gpu_tensor, size, file_offset=data_offset + start)
        tensors[name] = gpu_tensor

    cf.close()
    return tensors
📊

Model Loading Bandwidth: CPU Path vs GDS (H100, NVMe Gen4x4)

MethodBandwidth (GB/s)CPU UtilizationBounce Buffer
read + cudaMemcpy 6.2 100% (1 core) Yes (page cache)
mmap + cudaMemcpy 7.4 ~30% (page faults) Yes (page cache)
GDS (cuFile) 12.8 ~5% (setup only) No
GDS + 4x NVMe RAID-0 24.1 ~5% No

NVIDIA ModelExpress: Sub-200ms Cold Start

NVIDIA Dynamo’s ModelExpress takes a fundamentally different approach: instead of loading from a file, it restores a GPU memory snapshot. The model weights are pre-loaded into a persistent memory region (either GPU HBM via MPS, or a shared memory buffer) and new inference processes attach to the existing weights.

How ModelExpress Works

Traditional cold start:
  Process start -> Parse config -> Load weights -> Build model -> Ready
  [0ms]           [50ms]          [20000ms]       [200ms]       [20250ms]

ModelExpress cold start:
  Process start -> Attach to shared weights -> Build model skeleton -> Ready
  [0ms]           [10ms]                       [50ms]                 [60ms]

The key insight: model weights are read-only during inference. There is no reason to copy 140 GB of weights into each process’s private memory when they can share a single copy.

class ModelExpressLoader:
    """Simplified ModelExpress-style shared weight loading.
    Pre-loads weights into CUDA IPC-shared memory."""

    def __init__(self, model_path, device="cuda:0"):
        self.device = device
        self.model_path = model_path

    def create_snapshot(self):
        """One-time: load weights and create a shared snapshot."""
        import pickle
        from safetensors import safe_open

        f = safe_open(self.model_path, framework="pt", device=self.device)
        shared_tensors = {}

        for name in f.keys():
            tensor = f.get_tensor(name)
            # Make tensor shareable via CUDA IPC
            # This allocates from the CUDA IPC-capable memory pool
            shared_tensors[name] = tensor

        # Save IPC handles for each tensor
        ipc_handles = {}
        for name, tensor in shared_tensors.items():
            # Get the IPC memory handle for this tensor's storage
            handle = torch.cuda.ipc_collect()
            ipc_handles[name] = {
                "handle": tensor.storage()._share_cuda_(),
                "shape": tensor.shape,
                "dtype": tensor.dtype,
            }

        return ipc_handles, shared_tensors

    def attach_to_snapshot(self, ipc_handles):
        """Fast: attach to pre-loaded weights via IPC handles.
        This takes milliseconds, not seconds."""
        state_dict = {}
        for name, info in ipc_handles.items():
            # Reconstruct tensor from IPC handle - no data copy
            storage = torch.cuda.StorageFromIPCHandle(info["handle"])
            tensor = torch.tensor([], dtype=info["dtype"])
            tensor.set_(storage, 0, info["shape"])
            state_dict[name] = tensor
        return state_dict

NIXL: Network-Aware Loading

NVIDIA’s NIXL (Network Interface for Cross-node Loading) extends this concept across nodes. When a new node joins the inference cluster, instead of loading from NVMe, it pulls weights from an existing node’s GPU memory over RDMA:

NVMe loading:     140 GB / 7 GB/s   = 20.0 seconds
RDMA from peer:   140 GB / 400 Gbps = 2.8 seconds (InfiniBand NDR)
Shared memory:    140 GB / attach    = 0.06 seconds (IPC handles only)
class NIXLWeightTransfer:
    """Transfer model weights between nodes via RDMA.
    Uses GPUDirect RDMA: GPU HBM on node A -> NIC -> NIC -> GPU HBM on node B."""

    def __init__(self, local_rank, world_size):
        self.local_rank = local_rank
        self.world_size = world_size

    def pull_weights_from_peer(self, peer_rank, tensor_names):
        """Pull weights from a peer node's GPU via RDMA."""
        import torch.distributed as dist

        received_tensors = {}
        for name in tensor_names:
            # Allocate local buffer
            meta = self._get_tensor_meta(peer_rank, name)
            local_buf = torch.empty(
                meta["shape"], dtype=meta["dtype"],
                device=f"cuda:{self.local_rank}"
            )
            # NCCL recv uses GPUDirect RDMA when available
            dist.recv(local_buf, src=peer_rank)
            received_tensors[name] = local_buf

        return received_tensors

    def serve_weights_to_peer(self, dest_rank, state_dict):
        """Send local weights to a requesting peer."""
        import torch.distributed as dist

        for name, tensor in state_dict.items():
            dist.send(tensor, dst=dest_rank)

Cold Start Budget Breakdown

For production systems, cold start has a strict budget. Here is the breakdown for different deployment scenarios:

📊

Cold Start Budget by Deployment Scenario

ScenarioBudgetStrategyAchievable
Batch inference (offline) Minutes OK Standard safetensors load 20s
Serverless (scale-from-zero) Less than 30s Sharded + parallel NVMe + progressive 8-12s
Autoscaling (add replica) Less than 10s GDS + RAID-0 + progressive 5-7s
Failover (replace dead node) Less than 5s RDMA from peer node 2.8s
Hot standby (MPS sharing) Less than 200ms ModelExpress (IPC attach) 60ms

Putting It All Together: Production Loading Pipeline

class ProductionModelLoader:
    """Production model loader combining all optimizations."""

    def __init__(self, config):
        self.config = config
        self.device = config.device

    def load(self):
        """Load model using the fastest available method."""

        # Priority 1: Shared memory (ModelExpress-style)
        if self._shared_snapshot_available():
            return self._attach_shared(
                self._get_ipc_handles()
            )  # ~60ms

        # Priority 2: RDMA from peer
        if self._peer_available():
            return self._rdma_pull()  # ~3s

        # Priority 3: GDS + parallel NVMe
        if self._gds_available():
            return self._gds_parallel_load()  # ~6s

        # Priority 4: safetensors + progressive loading
        return self._progressive_safetensors_load()  # ~19s

    def _progressive_safetensors_load(self):
        """Fallback: progressive safetensors loading."""
        loader = ProgressiveModelLoader(
            self.config, self._get_shard_paths(), self.device
        )
        loader.start_loading()
        return loader

    def _gds_parallel_load(self):
        """GDS loading from multiple NVMe drives."""
        state_dict = {}
        for path in self._get_shard_paths():
            shard = gds_load_safetensors(str(path), self.device)
            state_dict.update(shard)
        return state_dict

Loading Bandwidth Scaling: 1-8 NVMe Drives (140 GB Model)

line
Metric 1 drive2 drives4 drives8 drives
read + cudaMemcpy (GB/s)
6.2
11.8
21.3
32.1
GDS (GB/s)
12.8
24.1
41.2
52.8
PCIe 4.0 limit (GB/s)
32
32
32
32
PCIe 5.0 limit (GB/s)
64
64
64
64

Checkpoint Format Comparison

The choice of checkpoint format has a direct impact on loading performance:

📊

Checkpoint Format Performance (70B FP16, NVMe Gen4)

FormatFile SizeDeserializationSupports mmapSupports GDSSafety
PyTorch (.bin) 140 GB 2.1s (pickle) No No Unsafe (arbitrary code)
safetensors (.safetensors) 140 GB 2ms (JSON header) Yes Yes Safe (no code exec)
GGUF (.gguf) Varies (quantized) 50ms (custom header) Yes No Safe
TensorRT engine (.plan) ~70 GB (optimized) 500ms (deserialize) No No Safe

Memory-Mapped Weight Sharing Across Processes

When running multiple model replicas on the same machine (common with tensor parallelism or multiple independent workers), mmap enables physical memory sharing:

import torch.multiprocessing as mp

def worker(rank, model_path, device_id):
    """Each worker mmaps the same file.
    The OS shares physical pages across all workers."""
    from safetensors import safe_open

    # All workers mmap the same file
    # Linux kernel shares the physical pages in page cache
    f = safe_open(model_path, framework="pt", device=f"cuda:{device_id}")

    state_dict = {}
    for name in f.keys():
        state_dict[name] = f.get_tensor(name)

    # Memory usage: each worker uses ~0 extra host RAM
    # because mmap pages are shared via the page cache
    return state_dict

def launch_workers(model_path, num_workers=8):
    """Launch 8 workers that share mmap pages."""
    mp.spawn(
        worker,
        args=(model_path,),
        nprocs=num_workers,
    )
    # Total host RAM for model weights: 140 GB (shared)
    # NOT 140 GB x 8 = 1120 GB
⚠️ Warning

mmap sharing only helps with host RAM. Each GPU still needs its own copy of the weights in HBM. For GPU-side sharing, use CUDA IPC (as in ModelExpress) or MPS (Multi-Process Service), which allows multiple processes to share a single GPU context and its memory allocations.

Benchmarking Your Loading Pipeline

import time
import json
import psutil
import torch

def benchmark_loading(loader_fn, model_path, device, warmup=1, trials=3):
    """Comprehensive loading benchmark."""
    results = []

    for trial in range(warmup + trials):
        # Clear caches
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()

        # Drop page cache (requires root)
        # os.system("echo 3 > /proc/sys/vm/drop_caches")

        process = psutil.Process()
        mem_before = process.memory_info().rss

        t_start = time.perf_counter()
        state_dict = loader_fn(model_path, device)
        torch.cuda.synchronize()
        t_end = time.perf_counter()

        mem_after = process.memory_info().rss
        gpu_peak = torch.cuda.max_memory_allocated() / 1e9

        if trial >= warmup:
            total_bytes = sum(t.nbytes for t in state_dict.values())
            elapsed = t_end - t_start
            results.append({
                "trial": trial - warmup,
                "elapsed_s": elapsed,
                "bandwidth_gbps": (total_bytes / 1e9) / elapsed,
                "host_ram_delta_gb": (mem_after - mem_before) / 1e9,
                "gpu_peak_gb": gpu_peak,
                "num_tensors": len(state_dict),
            })

    return results

The loading pipeline is the foundation of cold start performance. Every optimization downstream (quantization, compilation, CUDA graph capture) is irrelevant if the model takes 30 seconds to get into GPU memory. The path from safetensors + mmap (20s) to ModelExpress-style snapshot attachment (60ms) represents a 300x improvement, and each step along that path is a concrete engineering choice with clear tradeoffs in complexity, hardware requirements, and deployment flexibility.