In vLLM v0, Worker 0 was special. It ran the scheduler, prepared the full batch metadata, and broadcast it to Workers 1 through every iteration. The other workers received this metadata, ran the forward pass, participated in all-reduce operations, and sent results back to Worker 0 for sampling. This asymmetry had a concrete performance cost: scheduling on Worker 0 serialized with the forward pass because Worker 0 could not begin its portion of the model computation until scheduling was complete. The GPU on Worker 0 sat idle during scheduling while all other GPUs waited for the broadcast.
vLLM v1 eliminates this bottleneck by making all workers symmetric. The scheduler runs in a dedicated CPU process, completely decoupled from the GPU workers. Workers cache request state locally and receive only incremental updates (new token IDs, new positions, new requests, completed requests) rather than the full batch tensor every iteration. This reduces per-iteration communication from megabytes to kilobytes.
This post covers the v0 architecture and its bottleneck, the v1 symmetric design, the incremental update protocol, the NCCL all-reduce optimization strategy, and measured performance improvements.
The v0 Asymmetry Problem
v0 Architecture
In v0 with TP=4, the four workers formed an asymmetric group:
# vLLM v0: Asymmetric worker architecture
class V0WorkerGroup:
def __init__(self, tp_degree):
self.workers = []
for rank in range(tp_degree):
if rank == 0:
# Worker 0: scheduler + model shard + sampling
self.workers.append(V0DriverWorker(rank))
else:
# Workers 1..N-1: model shard only
self.workers.append(V0Worker(rank))
class V0DriverWorker:
"""Worker 0: runs scheduler, prepares inputs, drives execution."""
def step(self):
# Phase 1: Scheduling (CPU, Worker 0 only)
# All other workers' GPUs are IDLE during this
schedule_start = time.monotonic()
scheduler_outputs = self.scheduler.schedule()
# scheduler_outputs contains:
# - token_ids: [batch_size, max_seq_len] padded tensor
# - position_ids: [batch_size, max_seq_len]
# - block_tables: [batch_size, max_blocks]
# - seq_lens: [batch_size]
# - slot_mapping: [total_tokens]
schedule_time = time.monotonic() - schedule_start
# Phase 2: Prepare input tensors (CPU, Worker 0)
input_tensors = self._prepare_inputs(scheduler_outputs)
# input_tensors is ~2-4 MB for a typical batch
# Phase 3: Broadcast to all workers (CPU -> GPU -> NCCL broadcast)
broadcast_start = time.monotonic()
for tensor_name, tensor in input_tensors.items():
dist.broadcast(tensor, src=0)
broadcast_time = time.monotonic() - broadcast_start
# Phase 4: Forward pass (GPU, all workers participate)
hidden_states = self.model.forward(input_tensors)
# Phase 5: All-reduce (NCCL, all workers)
# Each TP shard produces partial output; all-reduce sums them
dist.all_reduce(hidden_states)
# Phase 6: Sampling (Worker 0 only)
next_tokens = self.sampler.sample(hidden_states)
return next_tokens
class V0Worker:
"""Workers 1..N-1: receive inputs, compute, all-reduce."""
def step(self):
# Wait for broadcast from Worker 0
input_tensors = {}
for tensor_name in expected_tensors:
tensor = torch.empty(expected_shape, device="cuda")
dist.broadcast(tensor, src=0)
input_tensors[tensor_name] = tensor
# Forward pass
hidden_states = self.model.forward(input_tensors)
# All-reduce
dist.all_reduce(hidden_states)
# Worker does NOT sample -- only Worker 0 does
Quantifying the Bottleneck
The scheduling phase on Worker 0 takes 0.5-2ms per iteration depending on batch size and complexity (preemption decisions, prefix cache lookups, block allocation). During this time, Workers 1-3 are completely idle.
v0 Worker 0 Scheduling Overhead (Llama 70B, TP=4, H100 NVLink)
| Batch Size | Schedule Time (ms) | Broadcast Time (ms) | Forward Pass (ms) | GPU Idle Time (Workers 1-3) | Overhead % |
|---|---|---|---|---|---|
| 32 | 0.5 | 0.3 | 28 | 0.8 ms | 2.8% |
| 128 | 1.2 | 0.8 | 32 | 2.0 ms | 5.9% |
| 256 | 1.8 | 1.2 | 35 | 3.0 ms | 7.9% |
| 512 | 2.5 | 1.8 | 38 | 4.3 ms | 10.2% |
At batch size 512, Workers 1-3 waste 4.3ms per iteration waiting for Worker 0. Over 1000 decode iterations (generating 1000 tokens), that is 4.3 seconds of accumulated GPU idle time per worker. With 3 idle workers on H100 at 0.007 per 1000-token generation just in wasted GPU time.
The broadcast itself is also wasteful. Every iteration, Worker 0 sends the full batch metadata — token IDs, position IDs, block tables — even though most of this data did not change from the previous iteration. In a decode step, only one new token per sequence changes, but the entire batch tensor is re-broadcast.
The v1 Symmetric Architecture
Design Principles
vLLM v1 restructures the worker group around three principles:
- All workers are identical: No driver worker. Every worker runs the same code path.
- Scheduler is a separate process: Scheduling happens on CPU in its own process, sending results via shared memory or ZMQ.
- Workers cache state: Each worker maintains a local copy of request state (token IDs, positions, block tables) and receives only incremental diffs.
# vLLM v1: Symmetric worker architecture
class V1WorkerGroup:
def __init__(self, tp_degree):
# Scheduler runs in its own process
self.scheduler_process = SchedulerProcess()
# All workers are identical
self.workers = [V1Worker(rank) for rank in range(tp_degree)]
# Shared memory region for scheduler -> worker communication
self.update_channel = SharedMemoryChannel(
size=1024 * 1024 # 1 MB buffer (more than enough for diffs)
)
class V1Worker:
"""Symmetric worker: caches state, receives incremental updates."""
def __init__(self, rank):
self.rank = rank
self.model_shard = None # Loaded during initialization
# Cached request state (mirrors the scheduler's view)
self.cached_requests = {} # request_id -> CachedRequestState
self.cached_batch = None # Pre-built batch tensors
def step(self):
# Phase 1: Receive incremental update (non-blocking if ready)
update = self.update_channel.recv()
# update is ~10 KB, not ~2 MB
# Phase 2: Apply update to cached state
self._apply_update(update)
# Phase 3: Build input tensors from cached state
input_tensors = self._build_inputs_from_cache()
# Phase 4: Forward pass (GPU)
hidden_states = self.model_shard.forward(input_tensors)
# Phase 5: All-reduce (NCCL)
dist.all_reduce(hidden_states)
# Phase 6: Sampling (ALL workers sample -- not just Worker 0)
next_tokens = self.sampler.sample(hidden_states)
# Phase 7: Feed next tokens back to cached state
self._update_tokens(next_tokens)
return next_tokens
In v0, only Worker 0 ran sampling because it was the only worker with the full hidden states after all-reduce. In v1, all workers have the full hidden states after all-reduce (all-reduce produces identical results on all ranks by definition). So all workers can sample independently. Since sampling uses the same random seed (synchronized across workers), all workers produce the same next-token IDs. This eliminates the need for Worker 0 to broadcast sampled tokens back to other workers.
The Scheduler Process
The scheduler runs in its own OS process, communicating with workers via shared memory:
class SchedulerProcess:
"""Dedicated CPU process for scheduling decisions."""
def __init__(self, worker_update_channels):
self.scheduler = Scheduler()
self.channels = worker_update_channels # One per worker (or multicast)
self.request_states = {} # Complete request state
def run_loop(self):
while True:
# Run scheduling (CPU-only, does not block any GPU)
outputs = self.scheduler.schedule()
# Compute incremental diff from previous state
diff = self._compute_diff(outputs)
# Send diff to all workers via shared memory
# This is a memcpy of ~10 KB, taking < 10 us
for channel in self.channels:
channel.send(diff)
# Wait for workers to signal completion of forward pass
self._wait_for_workers()
# Receive sampled tokens from workers
new_tokens = self._receive_tokens()
# Update scheduler state
self._process_new_tokens(new_tokens)
The critical path change: scheduling and GPU computation now overlap. While workers execute the forward pass for iteration , the scheduler can begin scheduling iteration . The GPU never waits for scheduling.
The Incremental Update Protocol
What Changes Between Iterations
During decode, the per-iteration changes are minimal:
@dataclass
class IncrementalUpdate:
"""Diff between iteration t and iteration t+1."""
# New requests entering the batch (prefill)
new_requests: list # Each: {request_id, token_ids, sampling_params}
# Completed requests leaving the batch
finished_requests: list # Each: request_id
# Per-sequence updates for continuing decode
token_updates: dict # request_id -> new_token_id (one int per sequence)
position_updates: dict # request_id -> new_position (one int per sequence)
# Block table changes (only when a new block is allocated)
block_table_updates: dict # request_id -> {new_block_id, slot_index}
# Preemption events
preempted_requests: list # request_ids to swap out
Size Comparison
For a batch of 256 sequences in decode mode:
# v0: Full batch broadcast every iteration
v0_broadcast_size = {
"token_ids": 256 * 4096 * 4, # [batch, max_seq_len] int32 = 4.19 MB
"position_ids": 256 * 4096 * 4, # [batch, max_seq_len] int32 = 4.19 MB
"block_tables": 256 * 256 * 4, # [batch, max_blocks] int32 = 262 KB
"seq_lens": 256 * 4, # [batch] int32 = 1 KB
"slot_mapping": 256 * 4096 * 4, # [total_tokens] int32 = 4.19 MB
}
# Total: ~12.8 MB per iteration
# v1: Incremental update
v1_update_size = {
"new_requests": 0, # No new requests in steady decode
"finished_reqs": 0, # No completions this iteration
"token_updates": 256 * 8, # 256 sequences, {req_id: int32, token: int32} = 2 KB
"position_updates": 256 * 8, # 256 sequences, {req_id: int32, pos: int32} = 2 KB
"block_updates": 16 * 12, # ~16 sequences need new block (every 16th token) = 192 B
"preemptions": 0, # No preemptions
}
# Total: ~4.2 KB per iteration
Per-Iteration Communication: v0 Broadcast vs v1 Incremental Update
(KB)The reduction is over 3000x for batch=256. This is not a bandwidth optimization — 12.8 MB over NVLink takes only 0.03ms. The savings come from eliminating the CPU-side tensor preparation on Worker 0 (which takes 1-2ms) and the serialization of scheduling with computation.
Applying Updates to Cached State
Each worker maintains a local CachedRequestState per active request:
@dataclass
class CachedRequestState:
"""Per-request state cached on each worker."""
request_id: int
token_ids: list # All tokens generated so far
current_position: int # Current position in the sequence
block_table: list # Physical block IDs for this sequence
sampling_params: SamplingParams
is_prefill: bool # True during first iteration
class V1Worker:
def _apply_update(self, update: IncrementalUpdate):
"""Apply incremental diff to cached state. O(diff_size), not O(batch_size)."""
# Add new requests (prefill entries)
for new_req in update.new_requests:
self.cached_requests[new_req.request_id] = CachedRequestState(
request_id=new_req.request_id,
token_ids=list(new_req.token_ids),
current_position=len(new_req.token_ids),
block_table=list(new_req.block_table),
sampling_params=new_req.sampling_params,
is_prefill=True
)
# Remove finished requests
for req_id in update.finished_requests:
del self.cached_requests[req_id]
# Update tokens (decode: one new token per sequence)
for req_id, new_token in update.token_updates.items():
state = self.cached_requests[req_id]
state.token_ids.append(new_token)
state.current_position += 1
state.is_prefill = False
# Update block tables (only for sequences that got new blocks)
for req_id, block_update in update.block_table_updates.items():
state = self.cached_requests[req_id]
state.block_table.append(block_update.new_block_id)
# Handle preemptions
for req_id in update.preempted_requests:
state = self.cached_requests.pop(req_id)
self._save_to_swap_space(state)
def _build_inputs_from_cache(self):
"""Build GPU input tensors from cached state."""
batch_size = len(self.cached_requests)
requests = list(self.cached_requests.values())
# For decode: each sequence contributes one token
if all(not r.is_prefill for r in requests):
token_ids = torch.tensor(
[r.token_ids[-1] for r in requests],
dtype=torch.long, device="cuda"
)
positions = torch.tensor(
[r.current_position - 1 for r in requests],
dtype=torch.long, device="cuda"
)
else:
# Mixed prefill + decode: variable-length packing
token_ids, positions = self._pack_mixed_batch(requests)
return {"token_ids": token_ids, "positions": positions}
All workers receive the same incremental update from the scheduler (via multicast shared memory). Since updates are applied deterministically (no random operations, no floating-point), all workers end up with identical cached state. This means all workers build identical input tensors and produce identical forward pass results (up to floating-point non-determinism in NCCL all-reduce, which is handled by using deterministic reduction algorithms). The sampling RNG is also synchronized, so all workers produce the same next-token selections.
NCCL All-Reduce Optimization
Tensor parallelism requires an all-reduce after every attention layer and every FFN layer. For Llama 70B with 80 layers, that is 160 all-reduce operations per forward pass. The all-reduce volume per operation is:
For batch_tokens=256 (decode) with hidden_dim=8192 and FP16:
Total all-reduce volume per forward pass: MB.
Ring All-Reduce Bandwidth
NCCL’s ring all-reduce for GPUs transfers bytes per GPU, where is the message size. For TP=4 and MB:
Over NVLink 4.0 at 450 GB/s bidirectional, this takes:
Total all-reduce time: ms.
But this is the theoretical minimum assuming perfect overlap. In practice, NCCL has launch overhead per operation.
Bucket-Based Overlapping
vLLM v1 reduces NCCL overhead by overlapping all-reduce with computation using a technique called bucket-based all-reduce:
class BucketedAllReduce:
"""
Overlap all-reduce with subsequent layer computation.
Instead of:
compute_layer_i -> all_reduce_i -> compute_layer_i+1 -> all_reduce_i+1
We do:
compute_layer_i -> launch_async_all_reduce_i -> compute_layer_i+1 (overlapped)
-> wait_all_reduce_i -> use_result_i
"""
def __init__(self, tp_group, num_buckets=4):
self.tp_group = tp_group
self.num_buckets = num_buckets
self.streams = [torch.cuda.Stream() for _ in range(num_buckets)]
self.events = [torch.cuda.Event() for _ in range(num_buckets)]
def all_reduce_async(self, tensor, bucket_id):
"""Launch all-reduce on a dedicated CUDA stream."""
stream = self.streams[bucket_id % self.num_buckets]
event = self.events[bucket_id % self.num_buckets]
with torch.cuda.stream(stream):
dist.all_reduce(tensor, group=self.tp_group)
event.record(stream)
return event # Caller waits on this before using the result
def wait(self, event):
"""Block the default stream until all-reduce completes."""
torch.cuda.current_stream().wait_event(event)
The overlap works because the all-reduce for layer does not depend on the compute for layer — they operate on different tensors. By launching the all-reduce on a separate CUDA stream, the GPU can begin computing layer ‘s matmuls on the default stream while the NVLink fabric handles the all-reduce communication concurrently.
class TensorParallelTransformerLayer:
def forward(self, hidden_states, bucketed_allreduce):
# Attention block
attn_out = self.self_attn(hidden_states) # Produces partial output
# Launch async all-reduce for attention output
ar_event_attn = bucketed_allreduce.all_reduce_async(
attn_out, bucket_id=self.layer_idx * 2
)
# While all-reduce runs, compute attention residual (no dependency)
# ... other independent work ...
# Wait for all-reduce before using the result
bucketed_allreduce.wait(ar_event_attn)
hidden_states = hidden_states + attn_out # Residual connection
# FFN block
ffn_out = self.ffn(hidden_states) # Produces partial output
# Launch async all-reduce for FFN output
ar_event_ffn = bucketed_allreduce.all_reduce_async(
ffn_out, bucket_id=self.layer_idx * 2 + 1
)
# While all-reduce runs on NVLink, begin next layer's attention QKV projection
# ... overlap region ...
bucketed_allreduce.wait(ar_event_ffn)
hidden_states = hidden_states + ffn_out # Residual connection
return hidden_states
Measured Overlap Efficiency
The overlap is most effective during prefill (large batch, long compute time per layer) and less effective during decode (small batch, short compute time per layer):
NCCL All-Reduce Overlap Efficiency (Llama 70B, TP=4, H100 NVLink)
| Phase | Batch Tokens | Compute/Layer (ms) | AllReduce/Layer (ms) | Overlap % | Effective AR Cost (ms) |
|---|---|---|---|---|---|
| Prefill | 4,096 | 1.8 | 0.09 | 95% | 0.005 |
| Prefill | 512 | 0.25 | 0.02 | 85% | 0.003 |
| Decode | 256 | 0.12 | 0.013 | 60% | 0.005 |
| Decode | 64 | 0.04 | 0.008 | 30% | 0.006 |
| Decode | 16 | 0.015 | 0.006 | 10% | 0.005 |
At batch_tokens=4096 (prefill), 95% of the all-reduce time is hidden behind computation. At batch_tokens=16 (small decode), only 10% is hidden — the all-reduce finishes long after the compute, so the next layer is blocked waiting.
Custom All-Reduce Kernels
For small message sizes (under 1 MB), NCCL’s ring all-reduce has high launch overhead relative to the actual data transfer. vLLM v1 includes custom all-reduce kernels that bypass NCCL for these small messages:
// Custom one-shot all-reduce for small messages
// Uses P2P NVLink writes instead of NCCL ring protocol
// Effective for messages < 1 MB on NVLink-connected GPUs
template <typename T, int kMaxSize>
__global__ void custom_all_reduce_kernel(
T** buffers, // Array of pointers to each rank's buffer
T* output, // Output buffer (same on all ranks)
int num_elements,
int num_ranks,
int my_rank
) {
const int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= num_elements) return;
// Phase 1: Each rank writes its data to all other ranks' buffers via NVLink
// (P2P memory access, enabled by NVLink)
T my_value = buffers[my_rank][tid];
// Phase 2: Sum contributions from all ranks
T sum = 0;
for (int r = 0; r < num_ranks; r++) {
sum += buffers[r][tid]; // P2P read from rank r's buffer
}
// Phase 3: Write result
output[tid] = sum;
}
This custom kernel reduces all-reduce latency for small messages from ~10us (NCCL) to ~3us, a 3.3x improvement. For decode with small batches, this matters because the all-reduce constitutes a larger fraction of total layer time.
All-Reduce Latency: NCCL vs Custom Kernel (TP=4, H100 NVLink)
(microseconds)The custom kernel advantage disappears above ~2 MB because NCCL’s ring protocol becomes bandwidth-efficient at larger message sizes. vLLM v1 automatically selects the custom kernel for messages under 1 MB and falls back to NCCL for larger messages.
End-to-End Performance Impact
Scheduling Overlap
The biggest win from symmetric workers is eliminating the scheduling-compute serialization:
# v0 timeline (TP=4):
# t=0: Worker 0 schedules | Workers 1-3 IDLE
# t=1.5ms: Worker 0 broadcasts | Workers 1-3 waiting for broadcast
# t=2.3ms: All workers begin forward |
# t=32ms: Forward pass complete |
# t=32ms: Worker 0 samples | Workers 1-3 IDLE again
# Total iteration: ~33.5ms
# v1 timeline (TP=4):
# t=0: Scheduler process already prepared update (during previous forward pass)
# t=0: All workers receive update via shared memory (< 0.01ms)
# t=0.01ms: All workers build inputs from cache (< 0.1ms)
# t=0.1ms: All workers begin forward
# t=30.1ms: Forward pass complete
# t=30.1ms: All workers sample (in parallel)
# Total iteration: ~30.5ms
# Meanwhile, scheduler is already working on iteration t+1
End-to-End Iteration Time: v0 vs v1 (Llama 70B, TP=4, H100)
| Batch Size | v0 Iteration (ms) | v1 Iteration (ms) | Speedup | Throughput Gain |
|---|---|---|---|---|
| 32 | 29.3 | 28.0 | 1.05x | +4.6% |
| 128 | 33.2 | 30.5 | 1.09x | +8.8% |
| 256 | 36.0 | 31.8 | 1.13x | +13.2% |
| 512 | 40.3 | 33.5 | 1.20x | +20.3% |
At batch=512, v1 is 20% faster per iteration. This directly translates to 20% higher decode throughput (more tokens per second).
Communication Savings
The incremental update protocol eliminates the per-iteration broadcast entirely. The communication that remains is the all-reduce within the forward pass, which is required by tensor parallelism regardless of the worker architecture:
Communication Volume Per Iteration: v0 vs v1
| Component | v0 (batch=256) | v1 (batch=256) | Reduction |
|---|---|---|---|
| Scheduler -> Worker broadcast | 12.8 MB | 4.2 KB | 3047x |
| All-reduce (attention, 80 layers) | 640 MB | 640 MB | 1x (unchanged) |
| All-reduce (FFN, 80 layers) | 640 MB | 640 MB | 1x (unchanged) |
| Sampled tokens (Worker 0 -> others) | 1 KB | 0 KB (all sample) | eliminated |
| Total | 1,292.8 MB | 1,280.004 MB | 1.01x overall |
The incremental update reduces bytes by 3047x, but total communication only drops by 1% because all-reduce dominates. The real win is latency: eliminating the sequential scheduling phase saves 2-4ms per iteration at high batch sizes. Over 1000 decode steps, that is 2-4 seconds of wall-clock time savings — directly improving time-to-last-token for every request in the batch.
Implementation Details
Shared Memory Channel
The scheduler process and worker processes communicate via a POSIX shared memory region:
class SharedMemoryChannel:
"""Lock-free single-producer, multi-consumer channel via shared memory."""
def __init__(self, name, size):
self.shm = shared_memory.SharedMemory(name=name, create=True, size=size)
self.buf = self.shm.buf
# Header: [sequence_number: int64, data_size: int64]
self.header_size = 16
self.data_offset = self.header_size
# Sequence number for detecting new updates
self._last_seen_seq = 0
def send(self, data: bytes):
"""Producer: write update to shared memory."""
seq = self._get_next_seq()
data_len = len(data)
# Write data first, then sequence number (memory fence ensures ordering)
self.buf[self.data_offset:self.data_offset + data_len] = data
struct.pack_into("qq", self.buf, 0, seq, data_len)
# Memory fence (implicit in Python struct.pack_into on x86)
def recv(self) -> bytes:
"""Consumer: read update from shared memory (spin-wait)."""
while True:
seq, data_len = struct.unpack_from("qq", self.buf, 0)
if seq > self._last_seen_seq:
self._last_seen_seq = seq
return bytes(self.buf[self.data_offset:self.data_offset + data_len])
# Spin: yields to OS scheduler after N iterations
spin_wait()
Shared memory avoids the kernel-space overhead of sockets or pipes. The update is a single memcpy of ~4-10 KB, completing in under 1 microsecond.
Worker Synchronization
Workers must stay synchronized: all workers must process the same update before beginning the forward pass. v1 uses NCCL barrier synchronization after applying the update:
class V1Worker:
def step(self):
# Receive and apply update
update = self.update_channel.recv()
self._apply_update(update)
input_tensors = self._build_inputs_from_cache()
# Barrier: ensure all workers have the same input tensors
# This is a zero-byte all-reduce (acts as a barrier)
dist.barrier(group=self.tp_group)
# Forward pass (all workers execute in lock-step from here)
hidden_states = self.model_shard.forward(input_tensors)
dist.all_reduce(hidden_states, group=self.tp_group)
next_tokens = self.sampler.sample(hidden_states)
self._update_tokens(next_tokens)
return next_tokens
The barrier adds ~2-3us of latency but guarantees that no worker enters the forward pass with stale state.
Summary
vLLM v1’s symmetric worker architecture eliminates the scheduling bottleneck that cost 3-10% of iteration time in v0. The key changes: the scheduler runs in a separate process (overlap with GPU computation), all workers are identical (no special driver worker), workers cache request state locally (receive only diffs), and all workers sample independently (no token broadcast).
The incremental update protocol reduces per-iteration scheduler-to-worker communication by 3000x, but the real benefit is latency rather than bandwidth: decoupling scheduling from the GPU critical path. NCCL all-reduce optimization through bucketed overlapping and custom small-message kernels further reduces communication overhead. Together, these changes yield a 5-20% end-to-end throughput improvement that scales with batch size.