Cold start latency for Llama 70B can exceed 90 seconds if you load all 140 GB into CPU RAM before sharding to GPUs. That is 90 seconds where your serving cluster is down, requests are queuing, and users are timing out. vLLM v1 cuts this to under 30 seconds by memory-mapping safetensors files and loading each GPU’s shard directly from disk without ever materializing the full model in CPU memory. The technique works because safetensors stores tensors with known offsets — you can seek directly to the bytes you need and DMA them straight to GPU HBM, bypassing CPU RAM entirely.
The loading pipeline has three stages: weight discovery (which tensors go to which GPU), weight deserialization (reading bytes into tensors), and weight distribution (placing shards on the correct GPU). Each stage has optimization opportunities. This post covers the complete pipeline from disk bytes to GPU-resident sharded weights, including the safetensors format, memory-mapped loading, sharded loading for tensor parallelism, and cold start optimization.
The Loading Problem
Weight File Formats
LLM weights are stored in several formats. The choice of format significantly impacts loading speed.
from dataclasses import dataclass
import os
from pathlib import Path
import struct
import json
import time
@dataclass
class WeightFileInfo:
format_name: str
file_extensions: list
supports_mmap: bool
supports_lazy_load: bool
supports_sharding: bool
header_format: str
typical_overhead: str
WEIGHT_FORMATS = [
WeightFileInfo(
format_name="safetensors",
file_extensions=[".safetensors"],
supports_mmap=True,
supports_lazy_load=True,
supports_sharding=True,
header_format="JSON header at file start, then raw tensors. "
"Header contains tensor names, dtypes, shapes, "
"and byte offsets.",
typical_overhead="8-byte header size + JSON header (~10 KB). "
"No Python pickle. No serialization overhead.",
),
WeightFileInfo(
format_name="PyTorch checkpoint (.bin)",
file_extensions=[".bin"],
supports_mmap=False,
supports_lazy_load=False,
supports_sharding=False,
header_format="Python pickle + ZIP archive. "
"Entire file must be deserialized.",
typical_overhead="Pickle deserialization: 5-15% of load time. "
"Must load all tensors even if only some needed.",
),
WeightFileInfo(
format_name="GGUF (llama.cpp)",
file_extensions=[".gguf"],
supports_mmap=True,
supports_lazy_load=True,
supports_sharding=False,
header_format="Binary header with tensor metadata. "
"Supports quantized formats natively.",
typical_overhead="Minimal. Designed for mmap loading. "
"Single-file format.",
),
]
Model Loading Time Comparison (Llama 70B, 4xH100, NVMe SSD)
| Method | CPU Memory | Load Time | GPU Transfer | Total | Notes |
|---|---|---|---|---|---|
| PyTorch .bin (naive) | 140 GB | 45s | 30s | 75s | Must hold full model in CPU RAM |
| safetensors (mmap) | ~0 GB | 0s (lazy) | 35s | 35s | Zero-copy, demand paging |
| safetensors (sharded mmap) | ~0 GB | 0s (lazy) | 20s | 20s | Each GPU loads only its shard |
| safetensors (mmap + pinned) | 35 GB | 8s (prefetch) | 12s | 20s | Pinned CPU memory for faster PCIe |
| Progressive loading | ~5 GB | 0s | 30s (overlap) | 18s* | Serve early layers while loading rest |
safetensors Format
Header Structure
The safetensors format is designed for fast, safe loading. The file starts with an 8-byte little-endian integer specifying the header size, followed by the JSON header, followed by raw tensor data.
class SafetensorsReader:
"""
Read safetensors files with zero-copy memory mapping.
File layout:
[8 bytes: header_size (u64 LE)]
[header_size bytes: JSON header]
[remaining bytes: raw tensor data]
JSON header format:
{
"__metadata__": {"format": "pt", ...},
"model.layers.0.self_attn.q_proj.weight": {
"dtype": "F16",
"shape": [8192, 8192],
"data_offsets": [0, 134217728]
},
...
}
"""
def __init__(self, file_path):
self.file_path = Path(file_path)
self.file_size = os.path.getsize(file_path)
self.header = None
self.tensor_info = {}
self.data_offset = 0
def read_header(self):
"""
Read and parse the header without loading any tensor data.
This is O(header_size), typically a few KB.
"""
with open(self.file_path, "rb") as f:
# Read header size (8 bytes, u64 little-endian)
header_size_bytes = f.read(8)
header_size = struct.unpack("<Q", header_size_bytes)[0]
# Read header JSON
header_bytes = f.read(header_size)
self.header = json.loads(header_bytes.decode("utf-8"))
# Data starts after header
self.data_offset = 8 + header_size
# Parse tensor info
for name, info in self.header.items():
if name.startswith("__"):
continue # Skip metadata
self.tensor_info[name] = {
"dtype": info["dtype"],
"shape": info["shape"],
"start": info["data_offsets"][0] + self.data_offset,
"end": info["data_offsets"][1] + self.data_offset,
"size_bytes": (info["data_offsets"][1]
- info["data_offsets"][0]),
}
return self.tensor_info
def mmap_tensor(self, tensor_name):
"""
Memory-map a single tensor without copying to RAM.
The OS maps the file region into virtual address space.
Physical pages are loaded on demand when accessed.
"""
import mmap
import numpy as np
info = self.tensor_info[tensor_name]
dtype_map = {
"F16": np.float16,
"BF16": np.float16, # Treat as F16 for numpy
"F32": np.float32,
"I32": np.int32,
"I64": np.int64,
"U8": np.uint8,
}
np_dtype = dtype_map.get(info["dtype"], np.float16)
with open(self.file_path, "rb") as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Create numpy array backed by mmap
data = np.frombuffer(
mm[info["start"]:info["end"]],
dtype=np_dtype,
).reshape(info["shape"])
return data
def get_tensor_names(self):
"""List all tensor names in the file."""
return list(self.tensor_info.keys())
def get_total_size(self):
"""Total size of all tensors in bytes."""
return sum(
info["size_bytes"]
for info in self.tensor_info.values()
)
Sharded safetensors
Large models are split across multiple safetensors files. An index file (model.safetensors.index.json) maps tensor names to their files.
class ShardedSafetensorsLoader:
"""
Load a model from sharded safetensors files.
Files:
model.safetensors.index.json (maps tensors to shard files)
model-00001-of-00015.safetensors
model-00002-of-00015.safetensors
...
model-00015-of-00015.safetensors
"""
def __init__(self, model_dir):
self.model_dir = Path(model_dir)
self.index = None
self.shard_readers = {}
self.tensor_to_shard = {}
def load_index(self):
"""Load the shard index file."""
index_path = self.model_dir / "model.safetensors.index.json"
if not index_path.exists():
# Single file model
single_file = list(
self.model_dir.glob("*.safetensors")
)
if single_file:
reader = SafetensorsReader(single_file[0])
reader.read_header()
self.shard_readers["single"] = reader
for name in reader.get_tensor_names():
self.tensor_to_shard[name] = "single"
return
with open(index_path) as f:
self.index = json.load(f)
# Build tensor -> shard mapping
weight_map = self.index.get("weight_map", {})
for tensor_name, shard_file in weight_map.items():
self.tensor_to_shard[tensor_name] = shard_file
# Initialize readers for each shard
shard_files = set(weight_map.values())
for shard_file in shard_files:
shard_path = self.model_dir / shard_file
reader = SafetensorsReader(shard_path)
reader.read_header()
self.shard_readers[shard_file] = reader
def get_tensor(self, tensor_name):
"""Get a tensor by name, loading from the correct shard."""
shard_file = self.tensor_to_shard.get(tensor_name)
if shard_file is None:
raise KeyError(f"Tensor not found: {tensor_name}")
reader = self.shard_readers[shard_file]
return reader.mmap_tensor(tensor_name)
def get_shard_for_layer(self, layer_idx):
"""
Get all tensors for a specific transformer layer.
Returns a dict of tensor_name -> mmap'd tensor.
"""
prefix = f"model.layers.{layer_idx}."
layer_tensors = {}
for tensor_name in self.tensor_to_shard:
if tensor_name.startswith(prefix):
layer_tensors[tensor_name] = self.get_tensor(
tensor_name
)
return layer_tensors
def get_loading_plan(self):
"""
Create an optimized loading plan that minimizes
disk seeks by reading shards sequentially.
"""
plan = []
for shard_file in sorted(self.shard_readers.keys()):
reader = self.shard_readers[shard_file]
tensors = []
for tensor_name, info in reader.tensor_info.items():
tensors.append({
"name": tensor_name,
"shard": shard_file,
"offset": info["start"],
"size": info["size_bytes"],
})
# Sort by offset for sequential reads
tensors.sort(key=lambda t: t["offset"])
plan.extend(tensors)
return plan
Weight Distribution for Tensor Parallelism
Sharding Weights Across GPUs
With tensor parallelism, each GPU holds a shard of each weight matrix. For a linear layer with weight shape , column parallelism shards along (each GPU gets ) and row parallelism shards along .
import torch
import numpy as np
class WeightDistributor:
"""
Distribute model weights across tensor parallelism ranks.
Each TP rank receives only its shard of each weight.
The distributor knows the sharding pattern for each
tensor based on the model architecture.
"""
def __init__(self, tp_degree, rank):
self.tp_degree = tp_degree
self.rank = rank
def shard_weight(self, tensor_name, weight):
"""
Shard a weight tensor for this TP rank.
Returns the shard that belongs to this rank.
"""
shard_spec = self._get_shard_spec(tensor_name)
if shard_spec is None:
# Not sharded (e.g., LayerNorm, embeddings)
return weight
dim = shard_spec["dim"]
total_size = weight.shape[dim]
shard_size = total_size // self.tp_degree
start = self.rank * shard_size
end = start + shard_size
if dim == 0:
return weight[start:end]
elif dim == 1:
return weight[:, start:end]
else:
raise ValueError(f"Unsupported shard dim: {dim}")
def _get_shard_spec(self, tensor_name):
"""
Determine how a tensor should be sharded based on its name.
Transformer sharding patterns:
- q_proj, k_proj, v_proj: column parallel (dim=0)
- o_proj: row parallel (dim=1)
- gate_proj, up_proj: column parallel (dim=0)
- down_proj: row parallel (dim=1)
- embed_tokens: column parallel (dim=1, vocab sharding)
- lm_head: column parallel (dim=0)
- layernorm: not sharded (replicated)
"""
if any(s in tensor_name for s in
["q_proj.weight", "k_proj.weight", "v_proj.weight",
"gate_proj.weight", "up_proj.weight"]):
return {"dim": 0, "type": "column_parallel"}
if any(s in tensor_name for s in
["o_proj.weight", "down_proj.weight"]):
return {"dim": 1, "type": "row_parallel"}
if "embed_tokens" in tensor_name:
return {"dim": 1, "type": "vocab_parallel"}
if "lm_head" in tensor_name:
return {"dim": 0, "type": "column_parallel"}
# Not sharded: layer_norm, rotary embeddings, etc.
return None
def compute_shard_sizes(self, model_config):
"""
Compute the memory required per GPU for sharded weights.
"""
hidden = model_config.get("hidden_size", 8192)
intermediate = model_config.get("intermediate_size", 28672)
n_layers = model_config.get("num_hidden_layers", 80)
vocab_size = model_config.get("vocab_size", 32000)
n_heads = model_config.get("num_attention_heads", 64)
n_kv_heads = model_config.get("num_key_value_heads", 8)
head_dim = hidden // n_heads
bytes_per_param = 2 # FP16
# Per-layer sharded sizes
# QKV projection: (n_heads * head_dim + 2 * n_kv_heads * head_dim, hidden) / tp
qkv_per_gpu = (
(n_heads * head_dim + 2 * n_kv_heads * head_dim)
// self.tp_degree * hidden * bytes_per_param
)
# O projection: (hidden, n_heads * head_dim) / tp
o_per_gpu = (
hidden * (n_heads * head_dim)
// self.tp_degree * bytes_per_param
)
# MLP: gate_proj and up_proj column-sharded,
# down_proj row-sharded
mlp_per_gpu = (
2 * (intermediate // self.tp_degree) * hidden
+ (intermediate // self.tp_degree) * hidden
) * bytes_per_param
# LayerNorm: replicated
ln_per_gpu = 2 * hidden * bytes_per_param * 2 # 2 LN per layer
per_layer = qkv_per_gpu + o_per_gpu + mlp_per_gpu + ln_per_gpu
# Embedding and LM head
embed_per_gpu = vocab_size * hidden * bytes_per_param // self.tp_degree
lm_head_per_gpu = vocab_size * hidden * bytes_per_param // self.tp_degree
total_per_gpu = (
per_layer * n_layers + embed_per_gpu + lm_head_per_gpu
)
return {
"per_layer_gb": per_layer / 1e9,
"embed_gb": embed_per_gpu / 1e9,
"lm_head_gb": lm_head_per_gpu / 1e9,
"total_per_gpu_gb": total_per_gpu / 1e9,
"total_model_gb": total_per_gpu * self.tp_degree / 1e9,
}
Direct-to-GPU Loading
Bypassing CPU Memory
The optimal loading path sends weight data directly from disk to GPU memory, bypassing CPU RAM. This uses CUDA pinned memory as a staging buffer and overlaps disk I/O with GPU transfers.
class DirectGPULoader:
"""
Load model weights directly to GPU memory with
minimal CPU memory usage.
Pipeline:
1. mmap the safetensors file (no CPU memory)
2. Allocate a small pinned CPU buffer (e.g., 256 MB)
3. For each tensor shard:
a. Copy from mmap to pinned buffer
b. Async copy from pinned buffer to GPU
c. While GPU copy runs, start next mmap read
"""
def __init__(self, tp_degree, rank, buffer_size_mb=256):
self.tp_degree = tp_degree
self.rank = rank
self.buffer_size = buffer_size_mb * 1024 * 1024
self.distributor = WeightDistributor(tp_degree, rank)
def load_model(self, model_dir, device):
"""
Load a sharded model to the specified GPU device.
Returns a dict of tensor_name -> GPU tensor.
"""
loader = ShardedSafetensorsLoader(model_dir)
loader.load_index()
# Get loading plan (sequential disk access)
plan = loader.get_loading_plan()
# Allocate pinned CPU buffer
pinned_buffer = torch.empty(
self.buffer_size // 2, # FP16 elements
dtype=torch.float16,
pin_memory=True,
)
# Create CUDA stream for async transfer
transfer_stream = torch.cuda.Stream(device=device)
gpu_tensors = {}
load_stats = {
"tensors_loaded": 0,
"bytes_transferred": 0,
"skipped_tensors": 0,
}
for tensor_info in plan:
tensor_name = tensor_info["name"]
# Load tensor via mmap (no actual RAM used until access)
cpu_tensor = loader.get_tensor(tensor_name)
# Shard for this TP rank
shard = self.distributor.shard_weight(
tensor_name, cpu_tensor
)
if shard is None:
load_stats["skipped_tensors"] += 1
continue
# Convert numpy to torch tensor
if isinstance(shard, np.ndarray):
shard_tensor = torch.from_numpy(shard.copy())
else:
shard_tensor = shard
# Async transfer to GPU
with torch.cuda.stream(transfer_stream):
gpu_tensor = shard_tensor.to(
device=device, non_blocking=True
)
gpu_tensors[tensor_name] = gpu_tensor
load_stats["tensors_loaded"] += 1
load_stats["bytes_transferred"] += (
shard_tensor.numel() * shard_tensor.element_size()
)
# Wait for all transfers to complete
transfer_stream.synchronize()
load_stats["bytes_transferred_gb"] = (
load_stats["bytes_transferred"] / 1e9
)
return gpu_tensors, load_stats
The key optimization: each GPU reads only its shard from disk. With 4-way TP and safetensors mmap, GPU 0 reads 35 GB from a 140 GB file (the mmap pages for its shard columns). GPU 1 reads a different 35 GB. The OS page cache may still load more, but the actual disk-to-GPU data path is 35 GB per GPU, not 140 GB.
Progressive Loading
Serve While Loading
Progressive loading starts serving requests before the full model is loaded. The first few layers are loaded and can begin processing prefill requests while later layers continue loading in the background.
class ProgressiveModelLoader:
"""
Load model layers progressively, enabling early serving.
Strategy:
1. Load embedding layer first
2. Load transformer layers 0, 1, 2, ...
3. After N layers loaded (e.g., first 25%), start accepting
requests for prefill (partial processing)
4. Continue loading remaining layers
5. Once all layers loaded, start full generation
During partial loading, prefill requests can start
processing through the loaded layers, building KV cache
entries. When remaining layers finish loading, decode
can begin immediately.
"""
def __init__(self, model_dir, tp_degree, rank, device):
self.model_dir = model_dir
self.tp_degree = tp_degree
self.rank = rank
self.device = device
self.loader = ShardedSafetensorsLoader(model_dir)
self.distributor = WeightDistributor(tp_degree, rank)
self.loaded_layers = set()
self.total_layers = 0
self.is_ready = False
self.loading_thread = None
def start_loading(self, on_layer_loaded=None):
"""
Start loading model in the background.
Calls on_layer_loaded(layer_idx) after each layer completes.
"""
self.loader.load_index()
# Determine total layers from tensor names
layer_indices = set()
for name in self.loader.tensor_to_shard:
parts = name.split(".")
for i, part in enumerate(parts):
if part == "layers" and i + 1 < len(parts):
try:
layer_indices.add(int(parts[i + 1]))
except ValueError:
pass
self.total_layers = max(layer_indices) + 1 if layer_indices else 0
import threading
def _load():
# Phase 1: Load embedding
self._load_component("embed")
if on_layer_loaded:
on_layer_loaded(-1) # -1 = embedding
# Phase 2: Load layers sequentially
for layer_idx in range(self.total_layers):
self._load_layer(layer_idx)
self.loaded_layers.add(layer_idx)
if on_layer_loaded:
on_layer_loaded(layer_idx)
# Phase 3: Load LM head
self._load_component("lm_head")
self.is_ready = True
if on_layer_loaded:
on_layer_loaded(self.total_layers)
self.loading_thread = threading.Thread(target=_load)
self.loading_thread.start()
def _load_layer(self, layer_idx):
"""Load a single transformer layer to GPU."""
layer_tensors = self.loader.get_shard_for_layer(layer_idx)
for tensor_name, cpu_tensor in layer_tensors.items():
shard = self.distributor.shard_weight(
tensor_name, cpu_tensor
)
if shard is not None:
if isinstance(shard, np.ndarray):
shard = torch.from_numpy(shard.copy())
shard.to(self.device)
def _load_component(self, component):
"""Load a non-layer component (embedding, lm_head)."""
for tensor_name in self.loader.tensor_to_shard:
if component in tensor_name:
cpu_tensor = self.loader.get_tensor(tensor_name)
shard = self.distributor.shard_weight(
tensor_name, cpu_tensor
)
if shard is not None:
if isinstance(shard, np.ndarray):
shard = torch.from_numpy(shard.copy())
shard.to(self.device)
def get_loading_progress(self):
"""Get loading progress."""
return {
"loaded_layers": len(self.loaded_layers),
"total_layers": self.total_layers,
"progress_pct": (
len(self.loaded_layers) / max(self.total_layers, 1) * 100
),
"is_ready": self.is_ready,
}
Progressive Loading: Time to First Request vs Full Model Load
| Metric | 0 | 5 | 10 | 15 | 20 | 25 | 30 | 35 |
|---|---|---|---|---|---|---|---|---|
| Standard loading (Llama 70B, 4xH100) | ||||||||
| Progressive loading |
Cold Start Optimization
Reducing Time to First Token
Cold start is the time from process launch to the first served request. For LLM serving, this includes: Python startup, model loading, KV cache allocation, CUDA initialization, and NCCL initialization.
class ColdStartOptimizer:
"""
Optimize cold start time for LLM serving.
Breakdown for Llama 70B on 4xH100:
- Python + imports: 3-5s
- CUDA initialization: 1-2s
- NCCL initialization: 2-5s
- Model loading: 20-35s (dominant)
- KV cache allocation: 1-2s
- Warmup (first forward pass): 2-3s
Total: 30-50s
Optimizations:
1. Pre-compiled Python (no cold import)
2. Model weight caching in tmpfs/ramdisk
3. Parallel NCCL init + model loading
4. Progressive loading for partial serving
"""
def __init__(self):
self.timings = {}
def optimize_startup(self, model_dir, tp_degree, rank, device):
"""
Optimized startup sequence.
"""
total_start = time.time()
# Phase 1: CUDA + NCCL initialization (overlap with model prep)
import threading
nccl_ready = threading.Event()
def init_nccl():
start = time.time()
torch.cuda.set_device(device)
# Initialize NCCL communicator
# In production: torch.distributed.init_process_group(...)
torch.cuda.synchronize()
self.timings["nccl_init"] = time.time() - start
nccl_ready.set()
nccl_thread = threading.Thread(target=init_nccl)
nccl_thread.start()
# Phase 2: While NCCL initializes, start model loading
start = time.time()
loader = DirectGPULoader(tp_degree, rank)
gpu_tensors, load_stats = loader.load_model(
model_dir, device
)
self.timings["model_load"] = time.time() - start
# Wait for NCCL
nccl_ready.wait()
nccl_thread.join()
# Phase 3: KV cache allocation
start = time.time()
self._allocate_kv_cache(device)
self.timings["kv_cache_alloc"] = time.time() - start
# Phase 4: Warmup forward pass
start = time.time()
self._warmup(device)
self.timings["warmup"] = time.time() - start
self.timings["total"] = time.time() - total_start
return {
"timings": self.timings,
"gpu_tensors": len(gpu_tensors),
"bytes_loaded": load_stats["bytes_transferred_gb"],
}
def _allocate_kv_cache(self, device):
"""Allocate KV cache blocks."""
# In production: allocate block tables and cache tensors
# Placeholder: allocate a large contiguous tensor
free_memory = torch.cuda.mem_get_info(device)[0]
# Use 60% of free memory for KV cache
cache_size = int(free_memory * 0.6)
# Allocate as FP16
n_elements = cache_size // 2
kv_cache = torch.empty(
n_elements, dtype=torch.float16, device=device
)
del kv_cache # Will be properly managed by block allocator
def _warmup(self, device):
"""Run a warmup forward pass to JIT-compile kernels."""
# Small input to trigger kernel compilation
dummy_input = torch.randint(
0, 32000, (1, 128), device=device
)
# In production: model.forward(dummy_input)
torch.cuda.synchronize()
def get_optimization_report(self):
"""Report on cold start timing breakdown."""
total = self.timings.get("total", 1)
report = []
for phase, duration in sorted(
self.timings.items(), key=lambda x: x[1], reverse=True
):
if phase == "total":
continue
pct = duration / total * 100
report.append({
"phase": phase,
"duration_s": round(duration, 2),
"percentage": round(pct, 1),
})
report.append({
"phase": "total",
"duration_s": round(total, 2),
"percentage": 100.0,
})
return report
Cold Start Timing Breakdown (Llama 70B, 4xH100, NVMe SSD)
| Phase | Standard (s) | Optimized (s) | Savings |
|---|---|---|---|
| Python + imports | 4.0 | 2.0 | Pre-import cached |
| CUDA init | 1.5 | 1.5 | Cannot optimize |
| NCCL init | 3.0 | 0.0* | Overlapped with model load |
| Model loading | 35.0 | 20.0 | Sharded mmap + pinned memory |
| KV cache alloc | 1.5 | 1.0 | Pre-computed block layout |
| Warmup | 2.5 | 1.5 | Minimal warmup tokens |
| Total | 47.5 | 26.0 | 45% reduction |
Key Takeaways
Model loading is the dominant component of cold start time. The difference between naive loading (75+ seconds for Llama 70B) and optimized loading (20-26 seconds) is the difference between a tolerable and intolerable startup delay.
Core optimizations:
-
safetensors mmap eliminates CPU memory overhead: Memory-mapped loading reads tensor data on demand without buffering the entire file in CPU RAM. This reduces CPU memory from 140 GB to near zero.
-
Sharded loading reduces per-GPU I/O: Each TP rank reads only its weight shard from disk. For 4-way TP, each GPU reads 35 GB instead of 140 GB. Combined with mmap, only the relevant file pages are loaded.
-
Overlap NCCL init with model loading: NCCL initialization takes 2-5 seconds and is independent of model loading. Running them in parallel saves the full NCCL init time.
-
Progressive loading enables early serving: Start accepting requests after loading the first few layers. Prefill processing can begin immediately; full generation starts when all layers are loaded. This reduces time-to-first-request by 30-50%.
-
Pinned memory accelerates GPU transfers: CPU-to-GPU data transfer over PCIe is 2-3x faster with pinned (page-locked) memory. The tradeoff: pinned memory is a scarce resource (limited by OS configuration, typically 50% of system RAM).
The fundamental limit: a 140 GB model on NVMe SSD (7 GB/s read) takes a minimum of 20 seconds for raw I/O. For 4-way TP sharded loading, the minimum is 35 GB / 7 GB/s = 5 seconds of I/O per GPU, plus PCIe transfer overhead. The theoretical floor is approximately 8-10 seconds. Current optimized implementations achieve 20-26 seconds, leaving room for further improvement through better I/O overlap and prefetching.