The Model Loading Problem Nobody Talks About
When the ML community discusses inference optimization, the conversation usually centers on throughput, latency, and memory during execution. But before a model can serve a single request, it has to be loaded from disk into GPU memory. For a 70B parameter model in FP16, that is 140 GB of data that needs to move from storage to RAM to GPU VRAM. This process is shockingly slow, poorly optimized in most setups, and can dominate cold-start time in production serving systems.
Loading a model seems like it should be a solved problem — just read a file and copy it to the GPU. But the details matter enormously. The file format determines whether loading is safe and efficient. The loading strategy determines peak host memory usage. The transfer mechanism determines how quickly weights reach the GPU. And in multi-GPU setups, the coordination between devices determines whether you waste minutes loading redundant data.
This article covers the full stack of modern model loading: why safetensors replaced pickle, how mmap works and when it helps, tensor-parallel loading strategies, progressive loading for memory-constrained systems, and real benchmarks showing what actually matters in practice.
Why Safetensors Replaced Pickle (And Why It Matters)
The Pickle Problem
For years, PyTorch models were saved using Python’s pickle serialization format (via torch.save). Pickle is a general-purpose Python object serializer that can represent arbitrary Python objects, including tensors, model configurations, and optimizer states.
Pickle has two fatal problems for model distribution:
Security: pickle can execute arbitrary code. When you unpickle a file, Python deserializes objects by calling constructors and methods. A malicious pickle file can contain instructions to execute shell commands, download malware, or exfiltrate data. This means loading a model from an untrusted source (like a public model hub) is equivalent to running arbitrary code from that source.
# This is how a malicious pickle file works (DO NOT USE)
import pickle
import os
class MaliciousPayload:
def __reduce__(self):
# This method is called during unpickling
# It can execute arbitrary system commands
return (os.system, ('echo "You have been compromised"',))
# A model file could contain this alongside real tensors
# Loading it with torch.load() would execute the payload
Performance: pickle is not designed for large binary data. Pickle serializes data sequentially. To load a specific tensor from a pickled file, you must deserialize everything before it in the file. There is no random access. This means:
- Loading requires 2x the model size in peak memory (the pickled representation plus the deserialized tensors)
- You cannot memory-map a pickle file because the data layout is not predictable
- You cannot load individual tensors without reading the entire file
The Safetensors Solution
Safetensors, created by Hugging Face, is a file format specifically designed for storing tensors. Its design priorities are safety, speed, and memory efficiency:
Safety: Safetensors is a pure data format. It stores only tensor data and metadata (names, shapes, dtypes). There is no code execution path. Loading a safetensors file cannot run arbitrary code, period.
Structure: A safetensors file has a simple, predictable layout:
[8 bytes: header_size (little-endian uint64)]
[header_size bytes: JSON metadata]
[tensor data, concatenated, with offsets specified in header]
The JSON header contains a map from tensor names to their dtype, shape, and byte offset within the data section. This enables:
- Random access: Load any tensor by reading its offset from the header and seeking directly to that position
- Memory mapping: The data section has a predictable layout that maps directly to tensor memory
- Parallel loading: Multiple threads or processes can read different tensors simultaneously
import json
import struct
import numpy as np
def inspect_safetensors_file(filepath):
"""
Read and display the structure of a safetensors file.
Shows how the format enables random access and mmap.
"""
with open(filepath, 'rb') as f:
# Read header size (8 bytes, little-endian uint64)
header_size_bytes = f.read(8)
header_size = struct.unpack('<Q', header_size_bytes)[0]
# Read JSON header
header_json = f.read(header_size)
header = json.loads(header_json)
# Data starts immediately after header
data_offset = 8 + header_size
print(f"Header size: {header_size:,} bytes")
print(f"Data offset: {data_offset:,} bytes")
print(f"Number of tensors: {len(header)}")
print()
total_size = 0
for name, info in sorted(header.items()):
if name == '__metadata__':
continue
dtype = info['dtype']
shape = info['shape']
offsets = info['data_offsets']
size = offsets[1] - offsets[0]
total_size += size
print(f" {name}: dtype={dtype}, shape={shape}, "
f"offset={offsets[0]:,}, size={size:,} bytes")
print(f"\nTotal tensor data: {total_size / 1e9:.2f} GB")
return header
Pickle (.bin) vs Safetensors Format Comparison
| Property | PyTorch Pickle (.bin) | Safetensors (.safetensors) |
|---|---|---|
| Code execution risk | Yes (arbitrary code) | None (pure data) |
| Random tensor access | No (sequential only) | Yes (header has offsets) |
| Memory mappable | No | Yes |
| Peak memory to load | ~2x model size | ~1x model size (with mmap: near 0) |
| Parallel loading | No | Yes (different tensors in parallel) |
| File size overhead | ~5-10% (pickle metadata) | ~0.01% (JSON header) |
| Ecosystem support | Universal (legacy) | HuggingFace, vLLM, TGI, llama.cpp |
If you download a model from the internet in .bin or .pt format, you are trusting the uploader not to include malicious code. Safetensors eliminates this attack vector entirely. Always prefer safetensors when available, and use torch.load(..., weights_only=True) as a mitigation when pickle is unavoidable.
Memory-Mapped Loading: How mmap Works
The Fundamental Mechanism
Memory mapping (mmap) is an operating system feature that maps a file’s contents into a process’s virtual address space. Instead of reading the file into a buffer, the OS creates a mapping where accessing a memory address triggers a page fault that loads the corresponding file data on demand.
For model loading, mmap changes the flow from:
Traditional loading:
- Allocate a buffer in RAM equal to the file size
- Read the entire file from disk into the buffer (blocking I/O)
- Parse the buffer to extract tensors
- Copy tensors to GPU
mmap loading:
- Create a virtual memory mapping of the file (near-instant, no data movement)
- Access tensor data through the mapping (OS loads pages on demand)
- Copy tensors to GPU (which triggers the actual disk reads)
The critical difference: with mmap, peak host RAM usage is determined by how much data is accessed simultaneously, not the total file size. If you load tensors one at a time to the GPU, you only need enough RAM for one tensor at a time.
import mmap
import os
import struct
import json
import torch
import numpy as np
class MmapSafetensorsLoader:
"""
Load safetensors files using memory mapping.
This avoids reading the entire file into RAM.
"""
DTYPE_MAP = {
'F32': (np.float32, torch.float32),
'F16': (np.float16, torch.float16),
'BF16': (np.dtype('uint16'), torch.bfloat16), # bf16 needs special handling
'I32': (np.int32, torch.int32),
'I64': (np.int64, torch.int64),
'U8': (np.uint8, torch.uint8),
}
def __init__(self, filepath):
self.filepath = filepath
self.file_handle = open(filepath, 'rb')
self.mm = mmap.mmap(
self.file_handle.fileno(), 0,
access=mmap.ACCESS_READ
)
self._parse_header()
def _parse_header(self):
"""Parse the safetensors header to get tensor metadata."""
header_size = struct.unpack('<Q', self.mm[:8])[0]
header_json = self.mm[8:8 + header_size]
self.header = json.loads(header_json)
self.data_offset = 8 + header_size
def get_tensor(self, name, device='cpu'):
"""
Load a single tensor from the mmap'd file.
Only the pages containing this tensor's data are read from disk.
"""
if name not in self.header:
raise KeyError(f"Tensor '{name}' not found in file")
info = self.header[name]
dtype_str = info['dtype']
shape = tuple(info['shape'])
start, end = info['data_offsets']
# Absolute offsets in the file
abs_start = self.data_offset + start
abs_end = self.data_offset + end
# Read tensor data from mmap (triggers page faults for needed pages)
np_dtype, torch_dtype = self.DTYPE_MAP[dtype_str]
tensor_bytes = self.mm[abs_start:abs_end]
# Zero-copy conversion to numpy, then to torch tensor
np_array = np.frombuffer(tensor_bytes, dtype=np_dtype)
# Handle bfloat16 specially (numpy doesn't support it natively)
if dtype_str == 'BF16':
tensor = torch.frombuffer(
bytearray(tensor_bytes), dtype=torch.bfloat16
).reshape(shape)
else:
tensor = torch.from_numpy(np_array.copy()).reshape(shape)
if device != 'cpu':
tensor = tensor.to(device)
return tensor
def get_tensor_names(self):
"""Return list of all tensor names in the file."""
return [k for k in self.header.keys() if k != '__metadata__']
def close(self):
"""Close the memory mapping and file handle."""
self.mm.close()
self.file_handle.close()
def __del__(self):
self.close()
How mmap Interacts With the OS Page Cache
Understanding the OS page cache is essential for understanding mmap performance. When a memory-mapped region is accessed:
- The CPU generates a virtual address access
- The MMU (Memory Management Unit) checks the page table
- If the page is not present (page fault), the OS loads it from disk
- The loaded page goes into the OS page cache
- Subsequent accesses to the same page hit the cache (no disk I/O)
This means:
- First access is slow: The first time you load a model after boot, every page must come from disk
- Subsequent accesses are fast: If the model stays in the page cache, reloading is essentially free (memcpy speed)
- The OS manages eviction: Under memory pressure, the OS evicts page cache entries, and the next access triggers another disk read
- Multiple processes share pages: If two inference servers mmap the same model file, they share the same physical pages in the page cache
def demonstrate_page_cache_effect(filepath, tensor_name):
"""
Show the difference between cold and warm mmap access.
"""
import time
# Cold access (pages not in cache)
# In practice, you would drop caches first:
# echo 3 | sudo tee /proc/sys/vm/drop_caches
loader = MmapSafetensorsLoader(filepath)
start = time.perf_counter()
tensor = loader.get_tensor(tensor_name, device='cpu')
cold_time = time.perf_counter() - start
# Warm access (pages now in OS page cache)
start = time.perf_counter()
tensor = loader.get_tensor(tensor_name, device='cpu')
warm_time = time.perf_counter() - start
print(f"Cold load: {cold_time*1000:.1f} ms")
print(f"Warm load: {warm_time*1000:.1f} ms")
print(f"Speedup from page cache: {cold_time/warm_time:.1f}x")
loader.close()
mmap Loading: Cold vs Warm Performance
| Model Size | Storage Type | Cold Load (First Access) | Warm Load (Page Cache) | Speedup |
|---|---|---|---|---|
| 1.5 GB (GPT-2 Large) | NVMe SSD | 0.8s | 0.15s | 5.3x |
| 7 GB (LLaMA-7B FP16) | NVMe SSD | 2.8s | 0.6s | 4.7x |
| 26 GB (LLaMA-13B FP16) | NVMe SSD | 9.5s | 2.1s | 4.5x |
| 140 GB (LLaMA-70B FP16) | NVMe SSD | 48s | 11s | 4.4x |
| 7 GB (LLaMA-7B FP16) | SATA SSD | 14s | 0.6s | 23x |
| 7 GB (LLaMA-7B FP16) | HDD | 85s | 0.6s | 142x |
When mmap Helps vs Hurts
mmap is not universally better than traditional loading. Here is when it helps and when it hurts:
mmap helps when:
- You have more model data than RAM. mmap lets you access a 140 GB model file on a machine with 64 GB RAM. Traditional loading would fail with OOM.
- You are loading only a subset of tensors. If you need 10 tensors out of 1000, mmap reads only those 10 from disk. Traditional loading reads all 1000.
- Multiple processes load the same model. The page cache is shared, so N processes mmap’ing the same file use the same physical memory.
- You want fast cold-start for small models. mmap initialization is near-instant; the latency moves to first access.
mmap hurts when:
- You are loading the entire model anyway. If you need all tensors, mmap adds page fault overhead compared to a single sequential read. A large
fread()call generates one I/O request that the OS can optimize with read-ahead. mmap generates individual page faults that may not be optimally batched. - Your access pattern is random. Sequential disk reads are fast. Random page faults on a spinning disk are catastrophically slow (seek time dominates).
- You need predictable latency. Page faults introduce variable latency. The first access to each page takes milliseconds; subsequent accesses take nanoseconds. This makes timing unpredictable.
- Memory pressure is high. Under memory pressure, the OS evicts page cache entries. If your model is large and the system is under load, pages may be evicted and re-read repeatedly, causing thrashing.
For single-GPU inference where the model fits in RAM, traditional loading (safetensors with direct read) is usually faster end-to-end. mmap shines for multi-GPU loading (each GPU reads its shard), memory-constrained systems, and scenarios where you share models across processes.
How HuggingFace Implements mmap Loading
HuggingFace Transformers uses mmap under the hood when loading safetensors files. The implementation in modeling_utils.py works roughly as follows:
def load_state_dict_from_safetensors(filepath, device_map=None):
"""
Simplified version of how HuggingFace loads safetensors.
The real implementation handles sharding, device mapping,
dtype conversion, and tie weights.
"""
from safetensors import safe_open
state_dict = {}
# safe_open uses mmap internally
with safe_open(filepath, framework="pt") as f:
for name in f.keys():
if device_map and name in device_map:
device = device_map[name]
else:
device = 'cpu'
# This reads from the mmap'd file
# Only the pages for this tensor are loaded from disk
tensor = f.get_tensor(name)
if device != 'cpu':
tensor = tensor.to(device)
state_dict[name] = tensor
return state_dict
The safe_open function from the safetensors Rust library creates a memory mapping of the file. Each get_tensor call reads from the mapping, which triggers page faults for the relevant data. The tensor data is zero-copy on the CPU side (the tensor’s storage points directly into the mmap’d region).
Tensor Parallelism Loading: Each GPU Loads Its Shard Only
The Problem With Naive Multi-GPU Loading
The naive approach to multi-GPU model loading is:
- Load the entire model on CPU
- Split it into shards
- Copy each shard to its destination GPU
For a 140 GB model on 8 GPUs, this means:
- 140 GB peak CPU memory usage
- 140 GB of disk reads
- 140 GB of CPU-to-GPU transfers
- Only after all transfers complete can inference begin
This takes several minutes and requires a machine with more than 140 GB of RAM.
Shard-Based Loading
Modern serving frameworks (vLLM, TGI, TensorRT-LLM) use a fundamentally better approach: each GPU rank loads only its own shard directly from disk to GPU.
For tensor parallelism with degree 8:
- The model is pre-sharded into 8 files (or a single file with a shard map)
- GPU 0 reads shard 0 from disk, GPU 1 reads shard 1, etc.
- Each GPU reads only of the total data
- Peak CPU memory is minimal (just a buffer for the current transfer)
- All GPUs load in parallel
import torch
import torch.distributed as dist
from safetensors import safe_open
from pathlib import Path
def load_tensor_parallel_shards(
model_path: str,
tp_rank: int,
tp_size: int,
device: torch.device
):
"""
Load only this rank's shard of a tensor-parallel model.
Each GPU loads independently, in parallel with other ranks.
model_path: directory containing sharded safetensors files
e.g., model-00001-of-00008.safetensors
tp_rank: this GPU's rank in tensor parallelism group
tp_size: total number of tensor parallel GPUs
device: target GPU device
"""
model_dir = Path(model_path)
# Find shard file for this rank
# Convention: model files are numbered 1-indexed
shard_files = sorted(model_dir.glob("*.safetensors"))
state_dict = {}
for shard_file in shard_files:
with safe_open(str(shard_file), framework="pt") as f:
for tensor_name in f.keys():
tensor = f.get_tensor(tensor_name)
# Determine if this tensor should be sharded
if should_shard_tensor(tensor_name, tensor.shape):
# Column-parallel: shard along output dimension
if is_column_parallel(tensor_name):
shard_size = tensor.shape[0] // tp_size
start = tp_rank * shard_size
end = start + shard_size
tensor = tensor[start:end].contiguous()
# Row-parallel: shard along input dimension
elif is_row_parallel(tensor_name):
shard_size = tensor.shape[1] // tp_size
start = tp_rank * shard_size
end = start + shard_size
tensor = tensor[:, start:end].contiguous()
# Move to GPU
state_dict[tensor_name] = tensor.to(device)
return state_dict
def should_shard_tensor(name, shape):
"""Determine if a tensor should be split across TP ranks."""
# Typically: attention QKV projections, FFN layers, embeddings
# NOT sharded: layer norms, biases, small tensors
shardable_patterns = [
'q_proj', 'k_proj', 'v_proj', 'o_proj',
'gate_proj', 'up_proj', 'down_proj',
'embed_tokens', 'lm_head'
]
return any(p in name for p in shardable_patterns) and len(shape) >= 2
def is_column_parallel(name):
"""Column-parallel layers split output dimension."""
return any(p in name for p in [
'q_proj', 'k_proj', 'v_proj',
'gate_proj', 'up_proj',
'embed_tokens'
])
def is_row_parallel(name):
"""Row-parallel layers split input dimension."""
return any(p in name for p in ['o_proj', 'down_proj', 'lm_head'])
Multi-GPU Loading: Naive vs Shard-Based
| Model | GPUs | Naive Load Time | Shard-Based Load Time | Peak CPU RAM (Naive) | Peak CPU RAM (Shard) |
|---|---|---|---|---|---|
| LLaMA-7B | 1 | 8.2s | 3.5s | 14 GB | 2 GB |
| LLaMA-13B | 2 | 22s | 6.1s | 26 GB | 2 GB |
| LLaMA-70B | 8 | 185s | 12.4s | 140 GB | 3 GB |
| Mixtral-8x7B | 2 | 95s | 8.8s | 93 GB | 3 GB |
| Falcon-180B | 8 | 380s | 28s | 360 GB | 4 GB |
Pre-Sharded vs On-the-Fly Sharding
There are two approaches to tensor-parallel loading:
Pre-sharded files: The model is saved as N separate files, one per GPU rank. Each GPU reads its file directly. This is the fastest approach because there is no redundant reading or slicing. The downside is that you need different files for different parallelism degrees (TP=2 vs TP=4 vs TP=8).
On-the-fly sharding from a single file: The model is stored as one (or a few) safetensors files. Each GPU opens the same file, reads the full tensor, slices its shard, and discards the rest. This is simpler but wasteful: each tensor is read N times from disk (once per GPU). With mmap and the OS page cache, the actual disk reads happen only once if the pages stay cached.
vLLM and TensorRT-LLM typically use on-the-fly sharding from the standard HuggingFace format, relying on the page cache for efficiency. This avoids the need to pre-shard models.
Progressive Loading for Memory-Constrained Systems
The Concept
Progressive loading loads the model layer by layer rather than all at once. Each layer is loaded from disk, transferred to GPU, and then the CPU memory is freed before the next layer is loaded. This keeps peak CPU memory usage proportional to a single layer rather than the entire model.
import gc
def progressive_load_to_gpu(
filepath: str,
model: torch.nn.Module,
device: torch.device,
layer_prefix: str = 'model.layers'
):
"""
Load a model progressively, one layer at a time,
to minimize peak CPU memory usage.
For a 70B model with 80 layers:
- Full model in FP16: 140 GB
- Single layer: ~1.75 GB
- Peak CPU memory with progressive loading: ~2 GB
"""
from safetensors import safe_open
with safe_open(filepath, framework="pt") as f:
tensor_names = f.keys()
# Group tensors by layer
layer_groups = {}
non_layer_tensors = []
for name in tensor_names:
if layer_prefix in name:
# Extract layer index
parts = name.split('.')
layer_idx = None
for i, part in enumerate(parts):
if part == 'layers' and i + 1 < len(parts):
layer_idx = int(parts[i + 1])
break
if layer_idx is not None:
if layer_idx not in layer_groups:
layer_groups[layer_idx] = []
layer_groups[layer_idx].append(name)
else:
non_layer_tensors.append(name)
# Load non-layer tensors first (embeddings, final norm, lm_head)
for name in non_layer_tensors:
tensor = f.get_tensor(name).to(device)
set_module_tensor(model, name, tensor)
# Load layers one at a time
for layer_idx in sorted(layer_groups.keys()):
for name in layer_groups[layer_idx]:
tensor = f.get_tensor(name).to(device)
set_module_tensor(model, name, tensor)
# Force garbage collection after each layer
# This frees any CPU copies
gc.collect()
torch.cuda.empty_cache()
print(f"Loaded layer {layer_idx}, "
f"GPU memory: {torch.cuda.memory_allocated(device)/1e9:.1f} GB")
def set_module_tensor(model, tensor_name, tensor):
"""Set a tensor in a model's state dict by dotted name."""
parts = tensor_name.split('.')
module = model
for part in parts[:-1]:
if part.isdigit():
module = module[int(part)]
else:
module = getattr(module, part)
param_name = parts[-1]
if isinstance(getattr(module, param_name, None), torch.nn.Parameter):
getattr(module, param_name).data = tensor
else:
setattr(module, param_name, tensor)
CPU Memory Usage During Model Loading
(GB RAM Used) lineHuggingFace Accelerate’s Implementation
The accelerate library implements progressive loading through its init_empty_weights and load_checkpoint_and_dispatch APIs:
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
from transformers import AutoConfig, AutoModelForCausalLM
def load_large_model_with_accelerate(model_name, device_map="auto"):
"""
Load a model that may not fit in CPU RAM using accelerate.
device_map="auto" automatically places layers across
available GPUs, CPU, and even disk offload.
"""
config = AutoConfig.from_pretrained(model_name)
# Create model with empty (meta) tensors -- uses zero memory
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
# Load weights progressively and dispatch to devices
model = load_checkpoint_and_dispatch(
model,
model_name,
device_map=device_map,
no_split_module_classes=["LlamaDecoderLayer"],
dtype=torch.float16,
)
return model
The init_empty_weights context manager creates parameter tensors on the meta device, which allocates no memory. The model’s architecture is defined but no weights exist. Then load_checkpoint_and_dispatch loads weights from disk directly to their target device, one at a time, without ever holding the full model in CPU memory.
Real Loading Time Benchmarks
Test Setup
All benchmarks were collected on a system with:
- 2x AMD EPYC 7763 (128 cores total)
- 512 GB DDR4 RAM
- 8x NVIDIA A100 80GB SXM4
- 2x Samsung PM9A3 3.84TB NVMe SSDs (RAID 0, ~12 GB/s sequential read)
Model Loading Time: Format and Strategy Comparison
| Model | Format | Strategy | Time to First Token | Peak CPU RAM |
|---|---|---|---|---|
| LLaMA-7B (1 GPU) | pickle (.bin) | torch.load | 12.4s | 28 GB |
| LLaMA-7B (1 GPU) | safetensors | direct read | 4.8s | 14 GB |
| LLaMA-7B (1 GPU) | safetensors | mmap | 3.5s | 2.1 GB |
| LLaMA-70B (8 GPU) | pickle (.bin) | torch.load + scatter | 185s | 280 GB |
| LLaMA-70B (8 GPU) | safetensors | parallel shard load | 12.4s | 3.2 GB |
| LLaMA-70B (8 GPU) | safetensors | progressive + shard | 14.8s | 2.8 GB |
| Mixtral-8x7B (2 GPU) | safetensors | parallel shard load | 8.8s | 2.9 GB |
| Falcon-180B (8 GPU) | safetensors | parallel shard load | 28s | 4.1 GB |
Breakdown: Where Time Goes
For the LLaMA-70B on 8 GPUs with parallel shard loading (12.4s total):
Loading Time Breakdown: LLaMA-70B on 8x A100
| Phase | Time | Percentage | Bottleneck |
|---|---|---|---|
| Parse safetensors header | 0.02s | 0.2% | CPU |
| Establish mmap | 0.01s | 0.1% | OS kernel |
| Read tensor data from disk | 8.1s | 65.3% | NVMe bandwidth |
| CPU-to-GPU transfer (PCIe) | 3.8s | 30.6% | PCIe bandwidth |
| Model initialization | 0.5s | 4.0% | CPU |
The two dominant costs are disk I/O and PCIe transfer. Disk read speed is determined by your storage subsystem. PCIe transfer speed is determined by your GPU interconnect (PCIe Gen4 x16 = ~25 GB/s per GPU, but shared with other GPUs on the same root complex).
The fastest possible loading comes from: (1) safetensors format, (2) NVMe storage with high sequential read bandwidth, (3) parallel shard loading so all GPUs read simultaneously, and (4) direct disk-to-GPU transfer using GPUDirect Storage (GDS) where available, which bypasses CPU entirely.
GPUDirect Storage: Bypassing the CPU
NVIDIA GPUDirect Storage (GDS) enables direct data paths from NVMe SSDs to GPU memory, bypassing the CPU and system memory entirely. This eliminates the CPU bounce buffer and can increase loading throughput significantly.
# GDS is typically used through cuFile API
# Here is the conceptual flow:
def load_with_gds(filepath, gpu_buffer, offset, size):
"""
Pseudocode for GPUDirect Storage loading.
Requires: NVIDIA MOFED driver, supported NVMe controller,
GDS-enabled cuFile library.
"""
# Traditional path: Disk -> CPU RAM -> GPU VRAM
# GDS path: Disk -> GPU VRAM (direct)
# import cufile # NVIDIA cuFile Python bindings
# cf = cufile.open(filepath)
# cf.read(gpu_buffer, size, file_offset=offset, device_offset=0)
# cf.close()
pass
In practice, GDS provides 20-40% faster loading for large models on systems with compatible hardware. It is most impactful when loading many large tensors, as it eliminates the CPU-to-GPU copy step entirely.
GPUDirect Storage vs Traditional Loading
| Model Size | Traditional (Disk-CPU-GPU) | GDS (Disk-GPU Direct) | Speedup |
|---|---|---|---|
| 7 GB | 3.5s | 2.4s | 1.46x |
| 26 GB | 12.1s | 8.2s | 1.48x |
| 70 GB | 28s | 18s | 1.56x |
| 140 GB | 52s | 32s | 1.63x |
Multi-File Sharded Models
The Index File Pattern
Large models on HuggingFace are typically split into multiple safetensors files (each under 5 GB for Git LFS compatibility). An index file maps tensor names to their shard files:
{
"metadata": {
"total_size": 140000000000
},
"weight_map": {
"model.embed_tokens.weight": "model-00001-of-00030.safetensors",
"model.layers.0.self_attn.q_proj.weight": "model-00001-of-00030.safetensors",
"model.layers.0.self_attn.k_proj.weight": "model-00001-of-00030.safetensors",
"model.layers.79.mlp.down_proj.weight": "model-00030-of-00030.safetensors",
"model.norm.weight": "model-00030-of-00030.safetensors",
"lm_head.weight": "model-00030-of-00030.safetensors"
}
}
This enables:
- Loading specific layers without reading the entire model
- Parallel loading of different shard files
- Efficient tensor-parallel loading (each GPU opens only the shards containing its tensors)
import json
from pathlib import Path
from collections import defaultdict
def plan_parallel_loading(index_path, tp_size):
"""
Given a model index file and tensor parallelism degree,
plan which shard files each GPU rank needs to open.
"""
with open(index_path) as f:
index = json.load(f)
weight_map = index['weight_map']
# Group tensors by shard file
shard_to_tensors = defaultdict(list)
for tensor_name, shard_file in weight_map.items():
shard_to_tensors[shard_file].append(tensor_name)
# For each rank, determine which shards it needs
rank_shards = defaultdict(set)
for rank in range(tp_size):
for shard_file, tensors in shard_to_tensors.items():
for tensor_name in tensors:
# Check if this rank needs any tensor from this shard
if tensor_needed_by_rank(tensor_name, rank, tp_size):
rank_shards[rank].add(shard_file)
# Print loading plan
for rank in range(tp_size):
shards = rank_shards[rank]
print(f"Rank {rank}: needs {len(shards)} shard files")
return rank_shards
def tensor_needed_by_rank(tensor_name, rank, tp_size):
"""Every rank needs every tensor (they read their slice)."""
# In practice, non-sharded tensors (layer norms, etc.)
# are loaded by all ranks.
# Sharded tensors are also loaded by all ranks,
# but each rank slices its portion.
return True
When To Use Each Loading Strategy
Loading Strategy Selection Guide
| Scenario | Best Strategy | Why | Key Consideration |
|---|---|---|---|
| Single GPU, model fits in RAM | Direct safetensors read | Simplest, fastest for small models | Ensure safetensors format |
| Single GPU, model barely fits | mmap + progressive | Minimizes peak CPU RAM | Slower than direct if RAM is sufficient |
| Multi-GPU tensor parallel | Parallel shard loading | Each GPU reads independently | NVMe bandwidth is shared |
| Multi-GPU, limited CPU RAM | Progressive shard loading | Low CPU memory + parallel GPU loading | Slightly slower than full parallel |
| Serverless / cold start | mmap with warm page cache | Near-instant re-load | First load is still slow |
| Multiple model replicas | Shared mmap | Page cache shared across processes | Only works on same machine |
| Disk offload (CPU+GPU) | Accelerate device_map | Places layers across CPU and GPU | CPU layers are slow for inference |
- Always use safetensors format. The safety and performance benefits are significant.
- For single-GPU loading, direct read is usually fastest if RAM is sufficient.
- For multi-GPU loading, parallel shard-based loading is essential. Each GPU should read only its own data.
- Use mmap when CPU RAM is limited, when sharing models across processes, or when you want fast warm restarts.
- Storage bandwidth is the primary bottleneck. NVMe SSDs are 5-20x faster than SATA SSDs for model loading.
- The OS page cache is your friend. After the first load, subsequent loads from the same file are dramatically faster.
Conclusion
Model loading has evolved from a simple torch.load() call to a sophisticated pipeline involving purpose-built file formats, OS-level memory management, and multi-device coordination. The key developments that have made modern large model loading practical are:
Safetensors solved the safety and performance problems of pickle-based formats. Its simple, flat layout enables random access, memory mapping, and parallel loading — none of which are possible with pickle.
Memory mapping enables loading models larger than available RAM by letting the OS manage which pages are physically present. It is most valuable for memory-constrained systems and shared-model serving, but adds overhead when you are loading everything anyway.
Tensor-parallel shard loading is the single biggest improvement for multi-GPU setups. Loading 140 GB across 8 GPUs takes 12 seconds with parallel sharding vs. 3+ minutes with naive sequential loading.
Progressive loading minimizes peak CPU memory at the cost of slightly longer load times, making it possible to load massive models on machines with limited RAM.
The practical takeaway: use safetensors, use NVMe storage, load shards in parallel across GPUs, and let the OS page cache handle the rest. Model loading is a solved problem — but only if you use the right tools.