Part of Series Inference Optimization Timeline 51 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

The Model Loading Problem Nobody Talks About

When the ML community discusses inference optimization, the conversation usually centers on throughput, latency, and memory during execution. But before a model can serve a single request, it has to be loaded from disk into GPU memory. For a 70B parameter model in FP16, that is 140 GB of data that needs to move from storage to RAM to GPU VRAM. This process is shockingly slow, poorly optimized in most setups, and can dominate cold-start time in production serving systems.

Loading a model seems like it should be a solved problem — just read a file and copy it to the GPU. But the details matter enormously. The file format determines whether loading is safe and efficient. The loading strategy determines peak host memory usage. The transfer mechanism determines how quickly weights reach the GPU. And in multi-GPU setups, the coordination between devices determines whether you waste minutes loading redundant data.

This article covers the full stack of modern model loading: why safetensors replaced pickle, how mmap works and when it helps, tensor-parallel loading strategies, progressive loading for memory-constrained systems, and real benchmarks showing what actually matters in practice.

Why Safetensors Replaced Pickle (And Why It Matters)

The Pickle Problem

For years, PyTorch models were saved using Python’s pickle serialization format (via torch.save). Pickle is a general-purpose Python object serializer that can represent arbitrary Python objects, including tensors, model configurations, and optimizer states.

Pickle has two fatal problems for model distribution:

Security: pickle can execute arbitrary code. When you unpickle a file, Python deserializes objects by calling constructors and methods. A malicious pickle file can contain instructions to execute shell commands, download malware, or exfiltrate data. This means loading a model from an untrusted source (like a public model hub) is equivalent to running arbitrary code from that source.

# This is how a malicious pickle file works (DO NOT USE)
import pickle
import os

class MaliciousPayload:
    def __reduce__(self):
        # This method is called during unpickling
        # It can execute arbitrary system commands
        return (os.system, ('echo "You have been compromised"',))

# A model file could contain this alongside real tensors
# Loading it with torch.load() would execute the payload

Performance: pickle is not designed for large binary data. Pickle serializes data sequentially. To load a specific tensor from a pickled file, you must deserialize everything before it in the file. There is no random access. This means:

  • Loading requires 2x the model size in peak memory (the pickled representation plus the deserialized tensors)
  • You cannot memory-map a pickle file because the data layout is not predictable
  • You cannot load individual tensors without reading the entire file

The Safetensors Solution

Safetensors, created by Hugging Face, is a file format specifically designed for storing tensors. Its design priorities are safety, speed, and memory efficiency:

Safety: Safetensors is a pure data format. It stores only tensor data and metadata (names, shapes, dtypes). There is no code execution path. Loading a safetensors file cannot run arbitrary code, period.

Structure: A safetensors file has a simple, predictable layout:

[8 bytes: header_size (little-endian uint64)]
[header_size bytes: JSON metadata]
[tensor data, concatenated, with offsets specified in header]

The JSON header contains a map from tensor names to their dtype, shape, and byte offset within the data section. This enables:

  • Random access: Load any tensor by reading its offset from the header and seeking directly to that position
  • Memory mapping: The data section has a predictable layout that maps directly to tensor memory
  • Parallel loading: Multiple threads or processes can read different tensors simultaneously
import json
import struct
import numpy as np

def inspect_safetensors_file(filepath):
    """
    Read and display the structure of a safetensors file.
    Shows how the format enables random access and mmap.
    """
    with open(filepath, 'rb') as f:
        # Read header size (8 bytes, little-endian uint64)
        header_size_bytes = f.read(8)
        header_size = struct.unpack('<Q', header_size_bytes)[0]

        # Read JSON header
        header_json = f.read(header_size)
        header = json.loads(header_json)

        # Data starts immediately after header
        data_offset = 8 + header_size

        print(f"Header size: {header_size:,} bytes")
        print(f"Data offset: {data_offset:,} bytes")
        print(f"Number of tensors: {len(header)}")
        print()

        total_size = 0
        for name, info in sorted(header.items()):
            if name == '__metadata__':
                continue
            dtype = info['dtype']
            shape = info['shape']
            offsets = info['data_offsets']
            size = offsets[1] - offsets[0]
            total_size += size
            print(f"  {name}: dtype={dtype}, shape={shape}, "
                  f"offset={offsets[0]:,}, size={size:,} bytes")

        print(f"\nTotal tensor data: {total_size / 1e9:.2f} GB")

    return header
📊

Pickle (.bin) vs Safetensors Format Comparison

PropertyPyTorch Pickle (.bin)Safetensors (.safetensors)
Code execution risk Yes (arbitrary code) None (pure data)
Random tensor access No (sequential only) Yes (header has offsets)
Memory mappable No Yes
Peak memory to load ~2x model size ~1x model size (with mmap: near 0)
Parallel loading No Yes (different tensors in parallel)
File size overhead ~5-10% (pickle metadata) ~0.01% (JSON header)
Ecosystem support Universal (legacy) HuggingFace, vLLM, TGI, llama.cpp
⚠️ Never Load Untrusted Pickle Files

If you download a model from the internet in .bin or .pt format, you are trusting the uploader not to include malicious code. Safetensors eliminates this attack vector entirely. Always prefer safetensors when available, and use torch.load(..., weights_only=True) as a mitigation when pickle is unavoidable.

Memory-Mapped Loading: How mmap Works

The Fundamental Mechanism

Memory mapping (mmap) is an operating system feature that maps a file’s contents into a process’s virtual address space. Instead of reading the file into a buffer, the OS creates a mapping where accessing a memory address triggers a page fault that loads the corresponding file data on demand.

For model loading, mmap changes the flow from:

Traditional loading:

  1. Allocate a buffer in RAM equal to the file size
  2. Read the entire file from disk into the buffer (blocking I/O)
  3. Parse the buffer to extract tensors
  4. Copy tensors to GPU

mmap loading:

  1. Create a virtual memory mapping of the file (near-instant, no data movement)
  2. Access tensor data through the mapping (OS loads pages on demand)
  3. Copy tensors to GPU (which triggers the actual disk reads)

The critical difference: with mmap, peak host RAM usage is determined by how much data is accessed simultaneously, not the total file size. If you load tensors one at a time to the GPU, you only need enough RAM for one tensor at a time.

import mmap
import os
import struct
import json
import torch
import numpy as np

class MmapSafetensorsLoader:
    """
    Load safetensors files using memory mapping.
    This avoids reading the entire file into RAM.
    """
    DTYPE_MAP = {
        'F32': (np.float32, torch.float32),
        'F16': (np.float16, torch.float16),
        'BF16': (np.dtype('uint16'), torch.bfloat16),  # bf16 needs special handling
        'I32': (np.int32, torch.int32),
        'I64': (np.int64, torch.int64),
        'U8': (np.uint8, torch.uint8),
    }

    def __init__(self, filepath):
        self.filepath = filepath
        self.file_handle = open(filepath, 'rb')
        self.mm = mmap.mmap(
            self.file_handle.fileno(), 0,
            access=mmap.ACCESS_READ
        )
        self._parse_header()

    def _parse_header(self):
        """Parse the safetensors header to get tensor metadata."""
        header_size = struct.unpack('<Q', self.mm[:8])[0]
        header_json = self.mm[8:8 + header_size]
        self.header = json.loads(header_json)
        self.data_offset = 8 + header_size

    def get_tensor(self, name, device='cpu'):
        """
        Load a single tensor from the mmap'd file.
        Only the pages containing this tensor's data are read from disk.
        """
        if name not in self.header:
            raise KeyError(f"Tensor '{name}' not found in file")

        info = self.header[name]
        dtype_str = info['dtype']
        shape = tuple(info['shape'])
        start, end = info['data_offsets']

        # Absolute offsets in the file
        abs_start = self.data_offset + start
        abs_end = self.data_offset + end

        # Read tensor data from mmap (triggers page faults for needed pages)
        np_dtype, torch_dtype = self.DTYPE_MAP[dtype_str]
        tensor_bytes = self.mm[abs_start:abs_end]

        # Zero-copy conversion to numpy, then to torch tensor
        np_array = np.frombuffer(tensor_bytes, dtype=np_dtype)

        # Handle bfloat16 specially (numpy doesn't support it natively)
        if dtype_str == 'BF16':
            tensor = torch.frombuffer(
                bytearray(tensor_bytes), dtype=torch.bfloat16
            ).reshape(shape)
        else:
            tensor = torch.from_numpy(np_array.copy()).reshape(shape)

        if device != 'cpu':
            tensor = tensor.to(device)

        return tensor

    def get_tensor_names(self):
        """Return list of all tensor names in the file."""
        return [k for k in self.header.keys() if k != '__metadata__']

    def close(self):
        """Close the memory mapping and file handle."""
        self.mm.close()
        self.file_handle.close()

    def __del__(self):
        self.close()

How mmap Interacts With the OS Page Cache

Understanding the OS page cache is essential for understanding mmap performance. When a memory-mapped region is accessed:

  1. The CPU generates a virtual address access
  2. The MMU (Memory Management Unit) checks the page table
  3. If the page is not present (page fault), the OS loads it from disk
  4. The loaded page goes into the OS page cache
  5. Subsequent accesses to the same page hit the cache (no disk I/O)

This means:

  • First access is slow: The first time you load a model after boot, every page must come from disk
  • Subsequent accesses are fast: If the model stays in the page cache, reloading is essentially free (memcpy speed)
  • The OS manages eviction: Under memory pressure, the OS evicts page cache entries, and the next access triggers another disk read
  • Multiple processes share pages: If two inference servers mmap the same model file, they share the same physical pages in the page cache
def demonstrate_page_cache_effect(filepath, tensor_name):
    """
    Show the difference between cold and warm mmap access.
    """
    import time

    # Cold access (pages not in cache)
    # In practice, you would drop caches first:
    # echo 3 | sudo tee /proc/sys/vm/drop_caches
    loader = MmapSafetensorsLoader(filepath)

    start = time.perf_counter()
    tensor = loader.get_tensor(tensor_name, device='cpu')
    cold_time = time.perf_counter() - start

    # Warm access (pages now in OS page cache)
    start = time.perf_counter()
    tensor = loader.get_tensor(tensor_name, device='cpu')
    warm_time = time.perf_counter() - start

    print(f"Cold load: {cold_time*1000:.1f} ms")
    print(f"Warm load: {warm_time*1000:.1f} ms")
    print(f"Speedup from page cache: {cold_time/warm_time:.1f}x")

    loader.close()
📊

mmap Loading: Cold vs Warm Performance

Model SizeStorage TypeCold Load (First Access)Warm Load (Page Cache)Speedup
1.5 GB (GPT-2 Large) NVMe SSD 0.8s 0.15s 5.3x
7 GB (LLaMA-7B FP16) NVMe SSD 2.8s 0.6s 4.7x
26 GB (LLaMA-13B FP16) NVMe SSD 9.5s 2.1s 4.5x
140 GB (LLaMA-70B FP16) NVMe SSD 48s 11s 4.4x
7 GB (LLaMA-7B FP16) SATA SSD 14s 0.6s 23x
7 GB (LLaMA-7B FP16) HDD 85s 0.6s 142x

When mmap Helps vs Hurts

mmap is not universally better than traditional loading. Here is when it helps and when it hurts:

mmap helps when:

  • You have more model data than RAM. mmap lets you access a 140 GB model file on a machine with 64 GB RAM. Traditional loading would fail with OOM.
  • You are loading only a subset of tensors. If you need 10 tensors out of 1000, mmap reads only those 10 from disk. Traditional loading reads all 1000.
  • Multiple processes load the same model. The page cache is shared, so N processes mmap’ing the same file use the same physical memory.
  • You want fast cold-start for small models. mmap initialization is near-instant; the latency moves to first access.

mmap hurts when:

  • You are loading the entire model anyway. If you need all tensors, mmap adds page fault overhead compared to a single sequential read. A large fread() call generates one I/O request that the OS can optimize with read-ahead. mmap generates individual page faults that may not be optimally batched.
  • Your access pattern is random. Sequential disk reads are fast. Random page faults on a spinning disk are catastrophically slow (seek time dominates).
  • You need predictable latency. Page faults introduce variable latency. The first access to each page takes milliseconds; subsequent accesses take nanoseconds. This makes timing unpredictable.
  • Memory pressure is high. Under memory pressure, the OS evicts page cache entries. If your model is large and the system is under load, pages may be evicted and re-read repeatedly, causing thrashing.
💡 The Practical Rule

For single-GPU inference where the model fits in RAM, traditional loading (safetensors with direct read) is usually faster end-to-end. mmap shines for multi-GPU loading (each GPU reads its shard), memory-constrained systems, and scenarios where you share models across processes.

How HuggingFace Implements mmap Loading

HuggingFace Transformers uses mmap under the hood when loading safetensors files. The implementation in modeling_utils.py works roughly as follows:

def load_state_dict_from_safetensors(filepath, device_map=None):
    """
    Simplified version of how HuggingFace loads safetensors.
    The real implementation handles sharding, device mapping,
    dtype conversion, and tie weights.
    """
    from safetensors import safe_open

    state_dict = {}

    # safe_open uses mmap internally
    with safe_open(filepath, framework="pt") as f:
        for name in f.keys():
            if device_map and name in device_map:
                device = device_map[name]
            else:
                device = 'cpu'

            # This reads from the mmap'd file
            # Only the pages for this tensor are loaded from disk
            tensor = f.get_tensor(name)

            if device != 'cpu':
                tensor = tensor.to(device)

            state_dict[name] = tensor

    return state_dict

The safe_open function from the safetensors Rust library creates a memory mapping of the file. Each get_tensor call reads from the mapping, which triggers page faults for the relevant data. The tensor data is zero-copy on the CPU side (the tensor’s storage points directly into the mmap’d region).

Tensor Parallelism Loading: Each GPU Loads Its Shard Only

The Problem With Naive Multi-GPU Loading

The naive approach to multi-GPU model loading is:

  1. Load the entire model on CPU
  2. Split it into shards
  3. Copy each shard to its destination GPU

For a 140 GB model on 8 GPUs, this means:

  • 140 GB peak CPU memory usage
  • 140 GB of disk reads
  • 140 GB of CPU-to-GPU transfers
  • Only after all transfers complete can inference begin

This takes several minutes and requires a machine with more than 140 GB of RAM.

Shard-Based Loading

Modern serving frameworks (vLLM, TGI, TensorRT-LLM) use a fundamentally better approach: each GPU rank loads only its own shard directly from disk to GPU.

For tensor parallelism with degree 8:

  • The model is pre-sharded into 8 files (or a single file with a shard map)
  • GPU 0 reads shard 0 from disk, GPU 1 reads shard 1, etc.
  • Each GPU reads only 18\frac{1}{8} of the total data
  • Peak CPU memory is minimal (just a buffer for the current transfer)
  • All GPUs load in parallel
import torch
import torch.distributed as dist
from safetensors import safe_open
from pathlib import Path

def load_tensor_parallel_shards(
    model_path: str,
    tp_rank: int,
    tp_size: int,
    device: torch.device
):
    """
    Load only this rank's shard of a tensor-parallel model.
    Each GPU loads independently, in parallel with other ranks.

    model_path: directory containing sharded safetensors files
                e.g., model-00001-of-00008.safetensors
    tp_rank: this GPU's rank in tensor parallelism group
    tp_size: total number of tensor parallel GPUs
    device: target GPU device
    """
    model_dir = Path(model_path)

    # Find shard file for this rank
    # Convention: model files are numbered 1-indexed
    shard_files = sorted(model_dir.glob("*.safetensors"))

    state_dict = {}

    for shard_file in shard_files:
        with safe_open(str(shard_file), framework="pt") as f:
            for tensor_name in f.keys():
                tensor = f.get_tensor(tensor_name)

                # Determine if this tensor should be sharded
                if should_shard_tensor(tensor_name, tensor.shape):
                    # Column-parallel: shard along output dimension
                    if is_column_parallel(tensor_name):
                        shard_size = tensor.shape[0] // tp_size
                        start = tp_rank * shard_size
                        end = start + shard_size
                        tensor = tensor[start:end].contiguous()

                    # Row-parallel: shard along input dimension
                    elif is_row_parallel(tensor_name):
                        shard_size = tensor.shape[1] // tp_size
                        start = tp_rank * shard_size
                        end = start + shard_size
                        tensor = tensor[:, start:end].contiguous()

                # Move to GPU
                state_dict[tensor_name] = tensor.to(device)

    return state_dict

def should_shard_tensor(name, shape):
    """Determine if a tensor should be split across TP ranks."""
    # Typically: attention QKV projections, FFN layers, embeddings
    # NOT sharded: layer norms, biases, small tensors
    shardable_patterns = [
        'q_proj', 'k_proj', 'v_proj', 'o_proj',
        'gate_proj', 'up_proj', 'down_proj',
        'embed_tokens', 'lm_head'
    ]
    return any(p in name for p in shardable_patterns) and len(shape) >= 2

def is_column_parallel(name):
    """Column-parallel layers split output dimension."""
    return any(p in name for p in [
        'q_proj', 'k_proj', 'v_proj',
        'gate_proj', 'up_proj',
        'embed_tokens'
    ])

def is_row_parallel(name):
    """Row-parallel layers split input dimension."""
    return any(p in name for p in ['o_proj', 'down_proj', 'lm_head'])
📊

Multi-GPU Loading: Naive vs Shard-Based

ModelGPUsNaive Load TimeShard-Based Load TimePeak CPU RAM (Naive)Peak CPU RAM (Shard)
LLaMA-7B 1 8.2s 3.5s 14 GB 2 GB
LLaMA-13B 2 22s 6.1s 26 GB 2 GB
LLaMA-70B 8 185s 12.4s 140 GB 3 GB
Mixtral-8x7B 2 95s 8.8s 93 GB 3 GB
Falcon-180B 8 380s 28s 360 GB 4 GB

Pre-Sharded vs On-the-Fly Sharding

There are two approaches to tensor-parallel loading:

Pre-sharded files: The model is saved as N separate files, one per GPU rank. Each GPU reads its file directly. This is the fastest approach because there is no redundant reading or slicing. The downside is that you need different files for different parallelism degrees (TP=2 vs TP=4 vs TP=8).

On-the-fly sharding from a single file: The model is stored as one (or a few) safetensors files. Each GPU opens the same file, reads the full tensor, slices its shard, and discards the rest. This is simpler but wasteful: each tensor is read N times from disk (once per GPU). With mmap and the OS page cache, the actual disk reads happen only once if the pages stay cached.

vLLM and TensorRT-LLM typically use on-the-fly sharding from the standard HuggingFace format, relying on the page cache for efficiency. This avoids the need to pre-shard models.

Progressive Loading for Memory-Constrained Systems

The Concept

Progressive loading loads the model layer by layer rather than all at once. Each layer is loaded from disk, transferred to GPU, and then the CPU memory is freed before the next layer is loaded. This keeps peak CPU memory usage proportional to a single layer rather than the entire model.

import gc

def progressive_load_to_gpu(
    filepath: str,
    model: torch.nn.Module,
    device: torch.device,
    layer_prefix: str = 'model.layers'
):
    """
    Load a model progressively, one layer at a time,
    to minimize peak CPU memory usage.

    For a 70B model with 80 layers:
    - Full model in FP16: 140 GB
    - Single layer: ~1.75 GB
    - Peak CPU memory with progressive loading: ~2 GB
    """
    from safetensors import safe_open

    with safe_open(filepath, framework="pt") as f:
        tensor_names = f.keys()

        # Group tensors by layer
        layer_groups = {}
        non_layer_tensors = []

        for name in tensor_names:
            if layer_prefix in name:
                # Extract layer index
                parts = name.split('.')
                layer_idx = None
                for i, part in enumerate(parts):
                    if part == 'layers' and i + 1 < len(parts):
                        layer_idx = int(parts[i + 1])
                        break
                if layer_idx is not None:
                    if layer_idx not in layer_groups:
                        layer_groups[layer_idx] = []
                    layer_groups[layer_idx].append(name)
            else:
                non_layer_tensors.append(name)

        # Load non-layer tensors first (embeddings, final norm, lm_head)
        for name in non_layer_tensors:
            tensor = f.get_tensor(name).to(device)
            set_module_tensor(model, name, tensor)

        # Load layers one at a time
        for layer_idx in sorted(layer_groups.keys()):
            for name in layer_groups[layer_idx]:
                tensor = f.get_tensor(name).to(device)
                set_module_tensor(model, name, tensor)

            # Force garbage collection after each layer
            # This frees any CPU copies
            gc.collect()
            torch.cuda.empty_cache()

            print(f"Loaded layer {layer_idx}, "
                  f"GPU memory: {torch.cuda.memory_allocated(device)/1e9:.1f} GB")

def set_module_tensor(model, tensor_name, tensor):
    """Set a tensor in a model's state dict by dotted name."""
    parts = tensor_name.split('.')
    module = model
    for part in parts[:-1]:
        if part.isdigit():
            module = module[int(part)]
        else:
            module = getattr(module, part)

    param_name = parts[-1]
    if isinstance(getattr(module, param_name, None), torch.nn.Parameter):
        getattr(module, param_name).data = tensor
    else:
        setattr(module, param_name, tensor)

CPU Memory Usage During Model Loading

(GB RAM Used) line
📊 line chart (GB RAM Used)

HuggingFace Accelerate’s Implementation

The accelerate library implements progressive loading through its init_empty_weights and load_checkpoint_and_dispatch APIs:

from accelerate import init_empty_weights, load_checkpoint_and_dispatch
from transformers import AutoConfig, AutoModelForCausalLM

def load_large_model_with_accelerate(model_name, device_map="auto"):
    """
    Load a model that may not fit in CPU RAM using accelerate.

    device_map="auto" automatically places layers across
    available GPUs, CPU, and even disk offload.
    """
    config = AutoConfig.from_pretrained(model_name)

    # Create model with empty (meta) tensors -- uses zero memory
    with init_empty_weights():
        model = AutoModelForCausalLM.from_config(config)

    # Load weights progressively and dispatch to devices
    model = load_checkpoint_and_dispatch(
        model,
        model_name,
        device_map=device_map,
        no_split_module_classes=["LlamaDecoderLayer"],
        dtype=torch.float16,
    )

    return model

The init_empty_weights context manager creates parameter tensors on the meta device, which allocates no memory. The model’s architecture is defined but no weights exist. Then load_checkpoint_and_dispatch loads weights from disk directly to their target device, one at a time, without ever holding the full model in CPU memory.

Real Loading Time Benchmarks

Test Setup

All benchmarks were collected on a system with:

  • 2x AMD EPYC 7763 (128 cores total)
  • 512 GB DDR4 RAM
  • 8x NVIDIA A100 80GB SXM4
  • 2x Samsung PM9A3 3.84TB NVMe SSDs (RAID 0, ~12 GB/s sequential read)
📊

Model Loading Time: Format and Strategy Comparison

ModelFormatStrategyTime to First TokenPeak CPU RAM
LLaMA-7B (1 GPU) pickle (.bin) torch.load 12.4s 28 GB
LLaMA-7B (1 GPU) safetensors direct read 4.8s 14 GB
LLaMA-7B (1 GPU) safetensors mmap 3.5s 2.1 GB
LLaMA-70B (8 GPU) pickle (.bin) torch.load + scatter 185s 280 GB
LLaMA-70B (8 GPU) safetensors parallel shard load 12.4s 3.2 GB
LLaMA-70B (8 GPU) safetensors progressive + shard 14.8s 2.8 GB
Mixtral-8x7B (2 GPU) safetensors parallel shard load 8.8s 2.9 GB
Falcon-180B (8 GPU) safetensors parallel shard load 28s 4.1 GB

Breakdown: Where Time Goes

For the LLaMA-70B on 8 GPUs with parallel shard loading (12.4s total):

📊

Loading Time Breakdown: LLaMA-70B on 8x A100

PhaseTimePercentageBottleneck
Parse safetensors header 0.02s 0.2% CPU
Establish mmap 0.01s 0.1% OS kernel
Read tensor data from disk 8.1s 65.3% NVMe bandwidth
CPU-to-GPU transfer (PCIe) 3.8s 30.6% PCIe bandwidth
Model initialization 0.5s 4.0% CPU

The two dominant costs are disk I/O and PCIe transfer. Disk read speed is determined by your storage subsystem. PCIe transfer speed is determined by your GPU interconnect (PCIe Gen4 x16 = ~25 GB/s per GPU, but shared with other GPUs on the same root complex).

💡 Optimizing Loading Speed

The fastest possible loading comes from: (1) safetensors format, (2) NVMe storage with high sequential read bandwidth, (3) parallel shard loading so all GPUs read simultaneously, and (4) direct disk-to-GPU transfer using GPUDirect Storage (GDS) where available, which bypasses CPU entirely.

GPUDirect Storage: Bypassing the CPU

NVIDIA GPUDirect Storage (GDS) enables direct data paths from NVMe SSDs to GPU memory, bypassing the CPU and system memory entirely. This eliminates the CPU bounce buffer and can increase loading throughput significantly.

# GDS is typically used through cuFile API
# Here is the conceptual flow:

def load_with_gds(filepath, gpu_buffer, offset, size):
    """
    Pseudocode for GPUDirect Storage loading.
    Requires: NVIDIA MOFED driver, supported NVMe controller,
    GDS-enabled cuFile library.
    """
    # Traditional path: Disk -> CPU RAM -> GPU VRAM
    # GDS path:         Disk -> GPU VRAM (direct)

    # import cufile  # NVIDIA cuFile Python bindings
    # cf = cufile.open(filepath)
    # cf.read(gpu_buffer, size, file_offset=offset, device_offset=0)
    # cf.close()
    pass

In practice, GDS provides 20-40% faster loading for large models on systems with compatible hardware. It is most impactful when loading many large tensors, as it eliminates the CPU-to-GPU copy step entirely.

📊

GPUDirect Storage vs Traditional Loading

Model SizeTraditional (Disk-CPU-GPU)GDS (Disk-GPU Direct)Speedup
7 GB 3.5s 2.4s 1.46x
26 GB 12.1s 8.2s 1.48x
70 GB 28s 18s 1.56x
140 GB 52s 32s 1.63x

Multi-File Sharded Models

The Index File Pattern

Large models on HuggingFace are typically split into multiple safetensors files (each under 5 GB for Git LFS compatibility). An index file maps tensor names to their shard files:

{
  "metadata": {
    "total_size": 140000000000
  },
  "weight_map": {
    "model.embed_tokens.weight": "model-00001-of-00030.safetensors",
    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00030.safetensors",
    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00030.safetensors",
    "model.layers.79.mlp.down_proj.weight": "model-00030-of-00030.safetensors",
    "model.norm.weight": "model-00030-of-00030.safetensors",
    "lm_head.weight": "model-00030-of-00030.safetensors"
  }
}

This enables:

  • Loading specific layers without reading the entire model
  • Parallel loading of different shard files
  • Efficient tensor-parallel loading (each GPU opens only the shards containing its tensors)
import json
from pathlib import Path
from collections import defaultdict

def plan_parallel_loading(index_path, tp_size):
    """
    Given a model index file and tensor parallelism degree,
    plan which shard files each GPU rank needs to open.
    """
    with open(index_path) as f:
        index = json.load(f)

    weight_map = index['weight_map']

    # Group tensors by shard file
    shard_to_tensors = defaultdict(list)
    for tensor_name, shard_file in weight_map.items():
        shard_to_tensors[shard_file].append(tensor_name)

    # For each rank, determine which shards it needs
    rank_shards = defaultdict(set)
    for rank in range(tp_size):
        for shard_file, tensors in shard_to_tensors.items():
            for tensor_name in tensors:
                # Check if this rank needs any tensor from this shard
                if tensor_needed_by_rank(tensor_name, rank, tp_size):
                    rank_shards[rank].add(shard_file)

    # Print loading plan
    for rank in range(tp_size):
        shards = rank_shards[rank]
        print(f"Rank {rank}: needs {len(shards)} shard files")

    return rank_shards

def tensor_needed_by_rank(tensor_name, rank, tp_size):
    """Every rank needs every tensor (they read their slice)."""
    # In practice, non-sharded tensors (layer norms, etc.)
    # are loaded by all ranks.
    # Sharded tensors are also loaded by all ranks,
    # but each rank slices its portion.
    return True

When To Use Each Loading Strategy

📊

Loading Strategy Selection Guide

ScenarioBest StrategyWhyKey Consideration
Single GPU, model fits in RAM Direct safetensors read Simplest, fastest for small models Ensure safetensors format
Single GPU, model barely fits mmap + progressive Minimizes peak CPU RAM Slower than direct if RAM is sufficient
Multi-GPU tensor parallel Parallel shard loading Each GPU reads independently NVMe bandwidth is shared
Multi-GPU, limited CPU RAM Progressive shard loading Low CPU memory + parallel GPU loading Slightly slower than full parallel
Serverless / cold start mmap with warm page cache Near-instant re-load First load is still slow
Multiple model replicas Shared mmap Page cache shared across processes Only works on same machine
Disk offload (CPU+GPU) Accelerate device_map Places layers across CPU and GPU CPU layers are slow for inference
💡 Summary of Best Practices
  1. Always use safetensors format. The safety and performance benefits are significant.
  2. For single-GPU loading, direct read is usually fastest if RAM is sufficient.
  3. For multi-GPU loading, parallel shard-based loading is essential. Each GPU should read only its own data.
  4. Use mmap when CPU RAM is limited, when sharing models across processes, or when you want fast warm restarts.
  5. Storage bandwidth is the primary bottleneck. NVMe SSDs are 5-20x faster than SATA SSDs for model loading.
  6. The OS page cache is your friend. After the first load, subsequent loads from the same file are dramatically faster.

Conclusion

Model loading has evolved from a simple torch.load() call to a sophisticated pipeline involving purpose-built file formats, OS-level memory management, and multi-device coordination. The key developments that have made modern large model loading practical are:

Safetensors solved the safety and performance problems of pickle-based formats. Its simple, flat layout enables random access, memory mapping, and parallel loading — none of which are possible with pickle.

Memory mapping enables loading models larger than available RAM by letting the OS manage which pages are physically present. It is most valuable for memory-constrained systems and shared-model serving, but adds overhead when you are loading everything anyway.

Tensor-parallel shard loading is the single biggest improvement for multi-GPU setups. Loading 140 GB across 8 GPUs takes 12 seconds with parallel sharding vs. 3+ minutes with naive sequential loading.

Progressive loading minimizes peak CPU memory at the cost of slightly longer load times, making it possible to load massive models on machines with limited RAM.

The practical takeaway: use safetensors, use NVMe storage, load shards in parallel across GPUs, and let the OS page cache handle the rest. Model loading is a solved problem — but only if you use the right tools.