A user sends a prompt, the model starts generating a response, and after 200 tokens the user navigates away from the page. Without cancellation, vLLM continues generating the remaining 1,800 tokens of the 2,048-token response, consuming GPU compute, KV cache blocks, and a batch slot for a response nobody will read. Multiply this by thousands of concurrent requests and the waste becomes significant: 10-30% of GPU cycles and KV cache capacity can be spent on cancelled requests.
vLLM v1 implements request cancellation through the async engine interface. When the SSE (Server-Sent Events) connection closes, the API server detects the disconnect and sends a cancellation signal to the engine. The engine marks the request for removal from the next batch, frees its KV cache blocks, and releases the batch slot. Early stopping extends this concept: instead of waiting for the user to cancel, the engine can stop generation when a confidence criterion is met (e.g., the model has already generated the answer and is now producing padding or repetition).
This post covers the complete cancellation and early stopping pipeline: detection, signaling, resource cleanup, partial result handling, and implementation.
The Cost of Uncancelled Requests
Quantifying Waste
from dataclasses import dataclass
from enum import Enum
import time
import threading
class CancellationReason(Enum):
USER_DISCONNECT = "user_disconnect"
CLIENT_TIMEOUT = "client_timeout"
EARLY_STOP = "early_stop"
PREEMPTION = "preemption"
SERVER_SHUTDOWN = "server_shutdown"
RATE_LIMIT = "rate_limit"
@dataclass
class RequestWasteEstimate:
total_requests: int
cancelled_requests: int
avg_tokens_wasted_per_cancel: int
gpu_seconds_wasted: float
kv_blocks_held_unnecessarily: int
batch_slots_wasted_seconds: float
class WasteEstimator:
"""
Estimate the resource waste from uncancelled requests.
"""
def estimate(self, request_stats):
"""
Estimate waste from a batch of request statistics.
request_stats: list of {
"total_tokens": int,
"tokens_at_cancel": int,
"was_cancelled": bool,
"cancel_detected_at": int (token position when cancel detected),
"request_duration_ms": float,
"kv_blocks_used": int,
}
"""
total = len(request_stats)
cancelled = [r for r in request_stats if r["was_cancelled"]]
n_cancelled = len(cancelled)
if n_cancelled == 0:
return RequestWasteEstimate(
total_requests=total,
cancelled_requests=0,
avg_tokens_wasted_per_cancel=0,
gpu_seconds_wasted=0,
kv_blocks_held_unnecessarily=0,
batch_slots_wasted_seconds=0,
)
# Tokens wasted = tokens generated after user disconnected
tokens_wasted = [
r["total_tokens"] - r["cancel_detected_at"]
for r in cancelled
]
# GPU time wasted (approximate: 20ms per token on H100)
ms_per_token = 20
gpu_ms_wasted = sum(t * ms_per_token for t in tokens_wasted)
# KV blocks held after cancel
blocks_wasted = sum(r["kv_blocks_used"] for r in cancelled)
# Batch slot time
slot_ms_wasted = sum(
r["request_duration_ms"] * (
1 - r["cancel_detected_at"] / max(r["total_tokens"], 1)
)
for r in cancelled
)
return RequestWasteEstimate(
total_requests=total,
cancelled_requests=n_cancelled,
avg_tokens_wasted_per_cancel=int(
sum(tokens_wasted) / max(n_cancelled, 1)
),
gpu_seconds_wasted=gpu_ms_wasted / 1000,
kv_blocks_held_unnecessarily=blocks_wasted,
batch_slots_wasted_seconds=slot_ms_wasted / 1000,
)
Resource Waste from Uncancelled Requests (10K RPS, Llama 70B)
| Metric | Without Cancellation | With Cancellation | Savings |
|---|---|---|---|
| Cancelled request rate | 15% of requests | 15% of requests | - |
| Avg tokens wasted per cancel | 800 | 5 (detection delay) | 99.4% |
| GPU seconds wasted per hour | 720s (20% of 1 GPU) | 4.5s | 99.4% |
| KV cache blocks held unnecessarily | ~3,000 (15% of capacity) | ~20 | 99.3% |
| Effective throughput loss | 12-18% | under 1% | ~15% |
| Batch slot utilization | 85% | 98% | +13% |
Cancellation Detection
SSE Disconnect Detection
In streaming mode, vLLM sends tokens over a Server-Sent Events (SSE) connection. When the client disconnects (user navigates away, closes tab, or the connection drops), the server detects a broken pipe on the next write attempt.
import asyncio
from typing import Optional, AsyncIterator
class CancellationToken:
"""
A token that can be checked to see if a request
has been cancelled.
Thread-safe: can be set from the API server thread
and checked from the engine thread.
"""
def __init__(self, request_id):
self.request_id = request_id
self._cancelled = False
self._cancel_reason = None
self._cancel_time = None
self._lock = threading.Lock()
self._event = asyncio.Event()
def cancel(self, reason=CancellationReason.USER_DISCONNECT):
"""Cancel the request."""
with self._lock:
if not self._cancelled:
self._cancelled = True
self._cancel_reason = reason
self._cancel_time = time.time()
# Signal async waiters
self._event.set()
@property
def is_cancelled(self):
"""Check if the request has been cancelled."""
return self._cancelled
@property
def cancel_reason(self):
"""Get the cancellation reason."""
return self._cancel_reason
async def wait_for_cancel(self):
"""Async wait for cancellation."""
await self._event.wait()
class SSECancellationDetector:
"""
Detect client disconnection on SSE streams.
Detection methods:
1. Write failure: next SSE write raises ConnectionResetError
2. Periodic heartbeat: send empty comment every 5s, detect failure
3. Client-side keepalive: client sends periodic pings
"""
def __init__(self, heartbeat_interval=5.0):
self.heartbeat_interval = heartbeat_interval
self.active_connections = {}
async def stream_with_cancellation(
self, request_id, token_iterator,
cancellation_token
):
"""
Stream tokens to client with cancellation detection.
Yields tokens until either:
1. Generation completes normally
2. Client disconnects (detected via write failure)
3. Cancellation token is set externally
"""
try:
async for token_chunk in token_iterator:
# Check cancellation before each write
if cancellation_token.is_cancelled:
break
# Attempt to write (will raise on disconnect)
yield self._format_sse(token_chunk)
except (ConnectionResetError, BrokenPipeError,
asyncio.CancelledError):
# Client disconnected
cancellation_token.cancel(
CancellationReason.USER_DISCONNECT
)
finally:
# Clean up
if request_id in self.active_connections:
del self.active_connections[request_id]
def _format_sse(self, data):
"""Format data as an SSE event."""
import json
return f"data: {json.dumps(data)}\n\n"
async def heartbeat_loop(self, request_id, write_fn,
cancellation_token):
"""
Send periodic heartbeats to detect silent disconnects.
Some proxies/load balancers drop idle connections
without sending a reset.
"""
while not cancellation_token.is_cancelled:
try:
await asyncio.sleep(self.heartbeat_interval)
# SSE comment (ignored by client, detects broken pipe)
await write_fn(": heartbeat\n\n")
except (ConnectionResetError, BrokenPipeError):
cancellation_token.cancel(
CancellationReason.USER_DISCONNECT
)
break
Engine-Side Cancellation
Removing Requests from the Batch
When the engine processes a cancellation, it must remove the request from the current batch and free all associated resources. This happens between decode steps — the engine checks for cancelled requests before each iteration.
class RequestCancellationManager:
"""
Manage request cancellations in the engine.
Cancellation happens at the boundary between decode steps:
1. Before each step, check for cancelled requests
2. Remove cancelled requests from the running batch
3. Free KV cache blocks
4. Free batch slot
5. Send partial result to the API layer
Cancellation is NOT instantaneous: the request completes
the current decode step before being removed. This adds
at most one TBT period (~20-30ms) of latency.
"""
def __init__(self):
self.cancellation_tokens = {} # request_id -> CancellationToken
self.pending_cancellations = []
self.cancellation_stats = {
"total_cancelled": 0,
"total_tokens_saved": 0,
"total_blocks_freed": 0,
"by_reason": {},
}
def register_request(self, request_id, cancellation_token):
"""Register a request's cancellation token."""
self.cancellation_tokens[request_id] = cancellation_token
def check_cancellations(self, running_requests):
"""
Check all running requests for cancellation.
Called once per decode step.
Returns list of request_ids to remove.
"""
to_cancel = []
for request_id in running_requests:
token = self.cancellation_tokens.get(request_id)
if token and token.is_cancelled:
to_cancel.append(request_id)
return to_cancel
def execute_cancellation(self, request_id, request_state,
kv_cache_manager):
"""
Execute cancellation: free resources and record stats.
request_state: {
"tokens_generated": int,
"max_tokens": int,
"kv_block_ids": list,
"started_at": float,
}
"""
# Free KV cache blocks
block_ids = request_state.get("kv_block_ids", [])
for block_id in block_ids:
kv_cache_manager.free_block(block_id)
# Compute savings
tokens_generated = request_state.get("tokens_generated", 0)
max_tokens = request_state.get("max_tokens", 0)
tokens_saved = max(0, max_tokens - tokens_generated)
# Record stats
cancel_token = self.cancellation_tokens.get(request_id)
reason = (cancel_token.cancel_reason.value
if cancel_token and cancel_token.cancel_reason
else "unknown")
self.cancellation_stats["total_cancelled"] += 1
self.cancellation_stats["total_tokens_saved"] += tokens_saved
self.cancellation_stats["total_blocks_freed"] += len(block_ids)
self.cancellation_stats["by_reason"][reason] = (
self.cancellation_stats["by_reason"].get(reason, 0) + 1
)
# Clean up
if request_id in self.cancellation_tokens:
del self.cancellation_tokens[request_id]
return {
"request_id": request_id,
"tokens_generated": tokens_generated,
"tokens_saved": tokens_saved,
"blocks_freed": len(block_ids),
"reason": reason,
}
class BatchManager:
"""
Manages the running batch with cancellation support.
"""
def __init__(self, kv_cache_manager):
self.running_batch = {} # request_id -> request_state
self.kv_cache = kv_cache_manager
self.cancel_manager = RequestCancellationManager()
def add_request(self, request_id, request_state,
cancellation_token):
"""Add a request to the running batch."""
self.running_batch[request_id] = request_state
self.cancel_manager.register_request(
request_id, cancellation_token
)
def pre_step_cleanup(self):
"""
Called before each decode step.
Removes cancelled requests from the batch.
Returns list of cancellation results.
"""
to_cancel = self.cancel_manager.check_cancellations(
list(self.running_batch.keys())
)
results = []
for request_id in to_cancel:
state = self.running_batch.pop(request_id)
result = self.cancel_manager.execute_cancellation(
request_id, state, self.kv_cache
)
results.append(result)
return results
def get_active_batch_size(self):
"""Get current number of active (non-cancelled) requests."""
return len(self.running_batch)
Cancellation granularity is per-decode-step. A request cancelled mid-step will complete the current step before being removed. For a typical TBT of 20-30ms, this means at most one extra token is generated after cancellation. This is acceptable for the significant resource savings.
KV Cache Block Reclamation
Immediate vs Deferred Freeing
When a request is cancelled, its KV cache blocks can be freed immediately (returned to the free pool) or deferred (marked for freeing but retained briefly in case of prefix cache hits).
class KVCacheReclaimer:
"""
Reclaim KV cache blocks from cancelled requests.
Two strategies:
1. Immediate: free blocks instantly (maximizes free space)
2. Deferred with prefix retention: keep prefix blocks
in cache for potential reuse by future requests
with similar prefixes
"""
def __init__(self, block_manager, prefix_cache_enabled=True):
self.block_manager = block_manager
self.prefix_cache_enabled = prefix_cache_enabled
self.reclaim_stats = {
"immediate_reclaims": 0,
"deferred_reclaims": 0,
"prefix_cache_hits": 0,
}
def reclaim(self, request_id, block_ids, token_count,
prompt_length):
"""
Reclaim blocks from a cancelled request.
If prefix caching is enabled, keep the prompt blocks
(they may be reused by other requests with the same
system prompt or prefix).
Free the generation blocks (unique to this request,
unlikely to be reused).
"""
if not self.prefix_cache_enabled:
# Free everything immediately
for block_id in block_ids:
self.block_manager.free_block(block_id)
self.reclaim_stats["immediate_reclaims"] += len(block_ids)
return {
"freed": len(block_ids),
"retained": 0,
}
# With prefix caching: separate prompt and generation blocks
tokens_per_block = self.block_manager.block_size
prompt_blocks = (prompt_length + tokens_per_block - 1) // tokens_per_block
gen_blocks = len(block_ids) - prompt_blocks
# Keep prompt blocks in prefix cache
prompt_block_ids = block_ids[:prompt_blocks]
gen_block_ids = block_ids[prompt_blocks:]
# Free generation blocks immediately
for block_id in gen_block_ids:
self.block_manager.free_block(block_id)
# Move prompt blocks to prefix cache (evictable, not locked)
for block_id in prompt_block_ids:
self.block_manager.mark_as_cached(block_id)
self.reclaim_stats["immediate_reclaims"] += len(gen_block_ids)
self.reclaim_stats["deferred_reclaims"] += len(prompt_block_ids)
return {
"freed": len(gen_block_ids),
"retained_in_cache": len(prompt_block_ids),
}
def force_reclaim_all(self, request_id, block_ids):
"""
Force-free all blocks (including prefix).
Used when KV cache is under pressure.
"""
for block_id in block_ids:
self.block_manager.free_block(block_id)
self.reclaim_stats["immediate_reclaims"] += len(block_ids)
return {"freed": len(block_ids), "retained": 0}
Early Stopping
Stopping Before Max Tokens
Early stopping terminates generation before reaching max_tokens when the model signals that the response is complete or when continued generation would be wasteful.
class EarlyStopDetector:
"""
Detect when generation should stop early.
Conditions:
1. EOS token generated (standard stop)
2. Stop sequence matched (user-defined stop strings)
3. Repetition detected (model stuck in a loop)
4. Confidence drop (model becomes uncertain)
5. Quality degradation (for batch processing)
"""
def __init__(self, stop_sequences=None, max_repetition=3,
min_confidence=0.1):
self.stop_sequences = stop_sequences or []
self.max_repetition = max_repetition
self.min_confidence = min_confidence
def should_stop(self, generated_tokens, token_probs,
eos_token_id):
"""
Check if generation should stop early.
Returns (should_stop, reason).
"""
# Check 1: EOS token
if generated_tokens and generated_tokens[-1] == eos_token_id:
return True, "eos_token"
# Check 2: Stop sequences
if self.stop_sequences:
stop_result = self._check_stop_sequences(
generated_tokens
)
if stop_result:
return True, f"stop_sequence: {stop_result}"
# Check 3: Repetition detection
if self._detect_repetition(generated_tokens):
return True, "repetition_detected"
# Check 4: Confidence drop
if token_probs and self._confidence_too_low(token_probs):
return True, "low_confidence"
return False, None
def _check_stop_sequences(self, tokens):
"""Check if any stop sequence has been generated."""
# In production: check decoded text against stop strings
# Simplified: check token patterns
for seq in self.stop_sequences:
if isinstance(seq, list):
# Token-level stop sequence
if (len(tokens) >= len(seq)
and tokens[-len(seq):] == seq):
return seq
return None
def _detect_repetition(self, tokens):
"""
Detect if the model is stuck in a repetition loop.
Check if the last N tokens repeat a pattern of length K
for max_repetition times.
"""
if len(tokens) < 20:
return False
# Check patterns of length 3 to 20
recent = tokens[-100:] # Check last 100 tokens
for pattern_len in range(3, min(21, len(recent) // 3)):
pattern = recent[-pattern_len:]
repeat_count = 0
pos = len(recent) - pattern_len
while pos >= pattern_len:
segment = recent[pos - pattern_len:pos]
if segment == pattern:
repeat_count += 1
pos -= pattern_len
else:
break
if repeat_count >= self.max_repetition:
return True
return False
def _confidence_too_low(self, token_probs):
"""
Check if the model's confidence has dropped too low.
Low confidence (high entropy) over multiple tokens
suggests the model is generating noise.
"""
if len(token_probs) < 5:
return False
# Check last 5 tokens
recent_probs = token_probs[-5:]
avg_prob = sum(recent_probs) / len(recent_probs)
return avg_prob < self.min_confidence
class BatchEarlyStopper:
"""
Early stopping for batch processing scenarios.
In batch processing (e.g., generating summaries for
1000 documents), early stopping saves significant
compute by stopping each request as soon as it produces
a complete answer.
Unlike interactive use where the user decides when to stop,
batch processing uses automated quality criteria.
"""
def __init__(self, task_type="summarization"):
self.task_type = task_type
self.criteria = self._get_criteria(task_type)
def _get_criteria(self, task_type):
"""Get stopping criteria for a task type."""
return {
"summarization": {
"max_length_factor": 0.3, # Summary < 30% of input
"stop_on_period_after": 50, # Stop after 50+ tokens
# at sentence boundary
"min_tokens": 20,
},
"classification": {
"max_tokens": 10, # Classification should be short
"stop_on_label": True,
},
"extraction": {
"stop_on_json_close": True, # Stop when JSON is complete
"max_tokens": 500,
},
}.get(task_type, {"max_tokens": 2048})
def should_stop(self, generated_text, generated_tokens,
input_length):
"""
Check if batch generation should stop for this request.
"""
criteria = self.criteria
n_tokens = len(generated_tokens)
# Minimum tokens check
min_tokens = criteria.get("min_tokens", 1)
if n_tokens < min_tokens:
return False, None
# Max length factor (for summarization)
if "max_length_factor" in criteria:
max_tokens = int(input_length * criteria["max_length_factor"])
if n_tokens >= max_tokens:
return True, "max_length_factor_reached"
# Stop at sentence boundary
if "stop_on_period_after" in criteria:
threshold = criteria["stop_on_period_after"]
if (n_tokens >= threshold
and generated_text.rstrip().endswith(".")):
return True, "sentence_boundary"
# JSON completion
if criteria.get("stop_on_json_close"):
if self._is_json_complete(generated_text):
return True, "json_complete"
return False, None
def _is_json_complete(self, text):
"""Check if generated text is complete JSON."""
text = text.strip()
if not text:
return False
# Count braces/brackets
open_braces = text.count("{") - text.count("}")
open_brackets = text.count("[") - text.count("]")
# Complete if all opened are closed and text starts with { or [
return (
(text.startswith("{") or text.startswith("["))
and open_braces == 0
and open_brackets == 0
)
Early Stopping Savings by Task Type
| Metric | Chat (user cancels) | Summarization | Classification | JSON extraction | Code generation |
|---|---|---|---|---|---|
| Average tokens saved vs max_tokens |
Partial Result Handling
Returning What Was Generated
When a request is cancelled, the partially generated response should still be returned to the client (for logging, debugging, or graceful degradation).
@dataclass
class PartialResult:
request_id: str
prompt: str
generated_text: str
generated_tokens: int
max_tokens: int
finish_reason: str # "cancelled", "early_stop", "length", "stop"
cancel_reason: str
cancel_token_position: int
partial: bool
class PartialResultHandler:
"""
Handle partial results from cancelled or early-stopped requests.
"""
def __init__(self):
self.partial_results = {}
def create_partial_result(self, request_id, request_state,
cancellation_info):
"""
Create a partial result from a cancelled request.
"""
result = PartialResult(
request_id=request_id,
prompt=request_state.get("prompt", ""),
generated_text=request_state.get("generated_text", ""),
generated_tokens=request_state.get("tokens_generated", 0),
max_tokens=request_state.get("max_tokens", 0),
finish_reason="cancelled",
cancel_reason=cancellation_info.get("reason", "unknown"),
cancel_token_position=request_state.get(
"tokens_generated", 0
),
partial=True,
)
self.partial_results[request_id] = result
return result
def format_api_response(self, result):
"""
Format partial result as an API response.
Compatible with OpenAI API format.
"""
return {
"id": result.request_id,
"object": "chat.completion",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": result.generated_text,
},
"finish_reason": result.finish_reason,
}],
"usage": {
"prompt_tokens": len(result.prompt.split()),
"completion_tokens": result.generated_tokens,
"total_tokens": (
len(result.prompt.split())
+ result.generated_tokens
),
},
"partial": result.partial,
"cancellation_info": {
"reason": result.cancel_reason,
"tokens_generated": result.generated_tokens,
"tokens_remaining": (
result.max_tokens - result.generated_tokens
),
},
}
Complete Cancellation Pipeline
End-to-End Integration
class CancellationPipeline:
"""
Complete cancellation pipeline integrating:
- SSE disconnect detection
- Engine-side request removal
- KV cache reclamation
- Partial result handling
- Metrics collection
"""
def __init__(self, batch_manager, kv_reclaimer,
partial_handler):
self.batch_manager = batch_manager
self.kv_reclaimer = kv_reclaimer
self.partial_handler = partial_handler
self.metrics = {
"cancellations_total": 0,
"tokens_saved_total": 0,
"blocks_freed_total": 0,
"early_stops_total": 0,
"avg_cancel_latency_ms": 0.0,
}
self._cancel_latencies = []
def on_client_disconnect(self, request_id):
"""
Called when a client disconnects.
This is the entry point for user-initiated cancellation.
"""
cancel_start = time.time()
# Signal the cancellation token
token = self.batch_manager.cancel_manager.cancellation_tokens.get(
request_id
)
if token:
token.cancel(CancellationReason.USER_DISCONNECT)
# The actual removal happens in the next pre_step_cleanup
# Track latency
self._cancel_latencies.append(
(time.time() - cancel_start) * 1000
)
def on_early_stop(self, request_id, reason):
"""
Called when early stopping criteria are met.
"""
token = self.batch_manager.cancel_manager.cancellation_tokens.get(
request_id
)
if token:
token.cancel(CancellationReason.EARLY_STOP)
self.metrics["early_stops_total"] += 1
def process_cancellations(self):
"""
Process all pending cancellations.
Called once per decode step by the engine loop.
"""
results = self.batch_manager.pre_step_cleanup()
for result in results:
# Reclaim KV cache
request_state = result # Simplified
reclaim_result = self.kv_reclaimer.reclaim(
result["request_id"],
request_state.get("kv_block_ids", []),
result["tokens_generated"],
request_state.get("prompt_length", 0),
)
# Create partial result
partial = self.partial_handler.create_partial_result(
result["request_id"], request_state, result
)
# Update metrics
self.metrics["cancellations_total"] += 1
self.metrics["tokens_saved_total"] += result["tokens_saved"]
self.metrics["blocks_freed_total"] += result["blocks_freed"]
if self._cancel_latencies:
self.metrics["avg_cancel_latency_ms"] = (
sum(self._cancel_latencies)
/ len(self._cancel_latencies)
)
return results
def get_metrics(self):
"""Get cancellation pipeline metrics."""
return dict(self.metrics)
Cancellation Pipeline Timing
| Stage | Time | Blocking | Notes |
|---|---|---|---|
| Client disconnect detection | 0-5ms | No | Async, detected on next SSE write |
| Cancellation token set | under 0.1ms | No | Atomic flag set |
| Engine detects in pre_step | 0-30ms* | No | Checked each decode step |
| Batch removal | under 0.1ms | No | Remove from running set |
| KV cache block free | under 0.1ms | No | Return blocks to free list |
| Total cancel latency | 5-35ms | - | Dominated by step boundary wait |
Key Takeaways
Request cancellation and early stopping are essential for efficient LLM serving. Without them, 10-30% of GPU resources are wasted on generating tokens that nobody reads.
The key implementation points:
-
Detection latency is bounded by TBT: Cancellation is detected and acted on at the boundary between decode steps. The worst-case delay is one TBT period (20-30ms for most models). This is acceptable because the user has already disconnected — sub-second precision is sufficient.
-
KV cache reclamation is the primary benefit: Freeing blocks from cancelled requests immediately returns capacity to serve new requests. In a KV-cache-constrained system (which most production deployments are), this directly translates to higher throughput.
-
Prefix caching interacts with cancellation: When a request is cancelled, its prompt blocks may still be useful for future requests with the same prefix. Deferring the free of prompt blocks while immediately freeing generation blocks gets the best of both worlds.
-
Early stopping provides even larger savings: User cancellation saves the tail of generation. Early stopping (repetition detection, confidence drop, task-specific criteria) saves the entire unnecessary generation. For batch processing tasks like classification, early stopping can save 90%+ of tokens.
-
Partial results must be propagated: Even cancelled requests should return what was generated. This enables graceful degradation (show partial response with “generation stopped”), logging (track cancellation patterns), and billing (charge for tokens actually generated, not requested).
The throughput impact: at a 15% cancellation rate with an average of 800 tokens saved per cancellation, a cluster serving 10K requests per second saves 1.2 million tokens per second of generation. At 20ms per token per request, that is equivalent to freeing 24,000 GPU-seconds per second — the equivalent of 6.7 additional H100 GPUs of capacity from software alone.