A 30-second video clip at 1 FPS generates 30 frames, each producing 300+ visual tokens — 9,000 tokens of visual context before the user has typed a single word. Audio at 16 kHz for the same 30 seconds yields 480,000 raw samples, compressed into hundreds of spectrogram tokens. These numbers dwarf typical text prompts by 10-30x, and they create batching challenges that do not exist in text-only serving: variable-length multi-modal contexts that can exceed 10,000 tokens, encoder forward passes that temporarily spike memory usage by 5-10 GB, and heterogeneous batch compositions where one request has a 500-token text prompt and another has a 9,000-token video prompt.
This post covers the temporal encoding pipelines for video and audio, the token interleaving strategy, batching under heterogeneous modality mixes, and the KV cache implications of large multi-modal contexts.
Video Processing Pipeline
Video input flows through three stages: frame sampling, per-frame encoding, and temporal aggregation.
Frame Sampling
Not every video frame carries unique information. vLLM samples frames at a configurable rate:
class VideoProcessor:
def __init__(self, config):
self.max_frames = config.get("max_frames", 32)
self.fps = config.get("fps", 1.0) # Sample 1 frame per second
self.min_frames = config.get("min_frames", 4)
def sample_frames(self, video_path: str) -> list:
"""Sample frames from video at configured FPS."""
import decord
vr = decord.VideoReader(video_path)
total_frames = len(vr)
video_fps = vr.get_avg_fps()
duration = total_frames / video_fps
# Calculate number of frames to sample
num_samples = int(duration * self.fps)
num_samples = max(self.min_frames,
min(num_samples, self.max_frames))
# Uniform sampling
indices = torch.linspace(0, total_frames - 1, num_samples).long()
frames = vr.get_batch(indices.tolist()).asnumpy()
return frames # Shape: [num_frames, H, W, 3]
Per-Frame Encoding
Each frame passes through the vision encoder (typically a ViT) to produce visual tokens:
class VideoEncoder:
def __init__(self, vision_encoder, projection):
self.vision_encoder = vision_encoder # e.g., SigLIP ViT-SO400M
self.projection = projection # Linear layer to LLM hidden dim
def encode_frames(self, frames: torch.Tensor) -> torch.Tensor:
"""Encode video frames into token embeddings.
Args:
frames: [num_frames, 3, H, W] preprocessed frames
Returns:
tokens: [num_frames * tokens_per_frame, hidden_dim]
"""
batch_size = frames.shape[0]
# Encode all frames in one batched forward pass
# ViT outputs: [num_frames, num_patches + 1, encoder_dim]
# num_patches = (H // patch_size) * (W // patch_size)
frame_features = self.vision_encoder(frames)
# Remove CLS token, keep patch tokens
patch_features = frame_features[:, 1:, :]
# Shape: [num_frames, num_patches, encoder_dim]
# Project to LLM hidden dimension
projected = self.projection(patch_features)
# Shape: [num_frames, num_patches, hidden_dim]
return projected
For a ViT with patch size 14 and input resolution 224x224:
- Patches per frame:
- Tokens per frame: 256
- For 30 frames: visual tokens
Visual Token Count by Video Configuration
| Duration (s) | FPS | Frames | Resolution | Patch Size | Tokens/Frame | Total Tokens |
|---|---|---|---|---|---|---|
| 10 | 1 | 10 | 224x224 | 14 | 256 | 2,560 |
| 30 | 1 | 30 | 224x224 | 14 | 256 | 7,680 |
| 60 | 1 | 32 (capped) | 224x224 | 14 | 256 | 8,192 |
| 30 | 2 | 32 (capped) | 336x336 | 14 | 576 | 18,432 |
| 30 | 1 | 30 | 448x448 | 14 | 1,024 | 30,720 |
At 448x448 resolution with 30 frames, the visual token count reaches 30,720 — exceeding the context length of many models. vLLM applies token pooling or dynamic resolution reduction to stay within the model’s maximum context window. Always set max_frames conservatively for production.
Temporal Aggregation
Raw per-frame tokens lack temporal information. Models use different strategies to inject temporal awareness:
Strategy 1: Temporal Position Embeddings
Add learned temporal embeddings to each frame’s tokens:
class TemporalPositionEncoder:
def __init__(self, max_frames: int, hidden_dim: int):
self.temporal_embed = torch.nn.Embedding(max_frames, hidden_dim)
def encode(self, frame_tokens: torch.Tensor,
num_frames: int) -> torch.Tensor:
"""Add temporal position to frame tokens.
Args:
frame_tokens: [num_frames, tokens_per_frame, hidden_dim]
Returns:
[num_frames, tokens_per_frame, hidden_dim]
"""
temporal_ids = torch.arange(num_frames, device=frame_tokens.device)
# Broadcast temporal embedding across all tokens in a frame
t_embed = self.temporal_embed(temporal_ids) # [num_frames, hidden_dim]
t_embed = t_embed.unsqueeze(1) # [num_frames, 1, hidden_dim]
return frame_tokens + t_embed
Strategy 2: Temporal Token Pooling
Reduce the total token count by pooling across frames:
class TemporalPooling:
def __init__(self, pool_size: int = 4):
self.pool_size = pool_size # Pool every N frames
def pool(self, frame_tokens: torch.Tensor) -> torch.Tensor:
"""Pool temporal tokens to reduce sequence length.
Args:
frame_tokens: [num_frames, tokens_per_frame, hidden_dim]
Returns:
[num_frames // pool_size, tokens_per_frame, hidden_dim]
"""
num_frames, tpf, dim = frame_tokens.shape
# Reshape to group frames
grouped = frame_tokens.view(
num_frames // self.pool_size, self.pool_size, tpf, dim
)
# Average pool across the temporal dimension
pooled = grouped.mean(dim=1)
return pooled
With pool_size=4 on a 30-frame video, the token count drops from 7,680 to 1,920 — a 4x reduction with minimal quality loss for slow-moving video content.
Audio Processing Pipeline
Audio follows a different encoding path through spectrogram extraction and a specialized audio encoder.
Spectrogram Extraction
class AudioProcessor:
def __init__(self, config):
self.sample_rate = config.get("sample_rate", 16000)
self.n_fft = config.get("n_fft", 400) # 25ms window at 16kHz
self.hop_length = config.get("hop_length", 160) # 10ms stride
self.n_mels = config.get("n_mels", 80) # Mel filterbank bins
self.max_duration = config.get("max_duration", 30.0) # seconds
def process(self, audio_path: str) -> torch.Tensor:
"""Convert audio file to log-mel spectrogram."""
import torchaudio
waveform, sr = torchaudio.load(audio_path)
if sr != self.sample_rate:
waveform = torchaudio.functional.resample(
waveform, sr, self.sample_rate
)
# Truncate to max duration
max_samples = int(self.max_duration * self.sample_rate)
waveform = waveform[:, :max_samples]
# Compute mel spectrogram
mel_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=self.sample_rate,
n_fft=self.n_fft,
hop_length=self.hop_length,
n_mels=self.n_mels
)
mel_spec = mel_transform(waveform) # [1, n_mels, time_steps]
# Log scale
log_mel = torch.log(mel_spec + 1e-6)
return log_mel.squeeze(0) # [n_mels, time_steps]
For 30 seconds of audio at 16 kHz with hop_length=160:
- Time steps:
- Spectrogram shape:
Audio Encoder
The audio encoder (typically Whisper-style) processes the spectrogram into tokens:
class AudioEncoder:
def __init__(self, encoder_model, projection):
self.encoder = encoder_model # Whisper encoder
self.projection = projection
def encode(self, log_mel: torch.Tensor) -> torch.Tensor:
"""Encode log-mel spectrogram to token embeddings.
Args:
log_mel: [n_mels, time_steps]
Returns:
tokens: [num_audio_tokens, hidden_dim]
"""
# Whisper encoder expects [batch, n_mels, time_steps]
features = self.encoder(log_mel.unsqueeze(0))
# Output: [1, time_steps // 2, encoder_dim] (2x downsampling)
projected = self.projection(features.squeeze(0))
# [num_audio_tokens, hidden_dim]
return projected
Audio Token Count by Duration
| Duration (s) | Spectrogram Frames | Encoder Downsample | Audio Tokens | Encoder Time (ms) |
|---|---|---|---|---|
| 5 | 500 | 2x | 250 | 8.2 |
| 10 | 1,000 | 2x | 500 | 14.5 |
| 30 | 3,000 | 2x | 1,500 | 38.1 |
| 60 | 6,000 | 2x | 3,000 | 72.4 |
| 120 | 12,000 | 2x | 6,000 | 141.8 |
Token Interleaving
Multi-modal inputs produce a mixed sequence of text tokens, visual tokens, and audio tokens. The LLM must know which tokens are which:
class MultiModalTokenInterleaver:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
# Special tokens for modality boundaries
self.VIDEO_START = tokenizer.encode("<video>")[0]
self.VIDEO_END = tokenizer.encode("</video>")[0]
self.FRAME_SEP = tokenizer.encode("<frame>")[0]
self.AUDIO_START = tokenizer.encode("<audio>")[0]
self.AUDIO_END = tokenizer.encode("</audio>")[0]
def build_input_sequence(self, text_tokens: list,
video_tokens: torch.Tensor,
audio_tokens: torch.Tensor,
insertion_points: dict) -> dict:
"""Build interleaved token sequence.
Returns dict with:
- input_ids: combined token IDs
- input_embeds: precomputed embeddings for non-text tokens
- modality_mask: which positions are text vs visual vs audio
"""
result_ids = []
result_embeds = []
modality_mask = []
for i, token_id in enumerate(text_tokens):
if i == insertion_points.get("video"):
# Insert video tokens
result_ids.append(self.VIDEO_START)
modality_mask.append("text")
for frame_idx in range(video_tokens.shape[0]):
result_ids.append(self.FRAME_SEP)
modality_mask.append("text")
for tok in range(video_tokens.shape[1]):
result_ids.append(-1) # Placeholder
result_embeds.append(
video_tokens[frame_idx, tok]
)
modality_mask.append("video")
result_ids.append(self.VIDEO_END)
modality_mask.append("text")
elif i == insertion_points.get("audio"):
result_ids.append(self.AUDIO_START)
modality_mask.append("text")
for tok in range(audio_tokens.shape[0]):
result_ids.append(-1)
result_embeds.append(audio_tokens[tok])
modality_mask.append("audio")
result_ids.append(self.AUDIO_END)
modality_mask.append("text")
result_ids.append(token_id)
modality_mask.append("text")
return {
"input_ids": result_ids,
"input_embeds": torch.stack(result_embeds) if result_embeds else None,
"modality_mask": modality_mask
}
Batching Heterogeneous Modalities
The batching challenge: a batch might contain text-only requests, image requests, video requests, and audio requests simultaneously. Each has different context lengths and encoder requirements.
class MultiModalBatchScheduler:
def __init__(self, max_batch_tokens: int, max_encoder_batch: int):
self.max_batch_tokens = max_batch_tokens
self.max_encoder_batch = max_encoder_batch
def schedule_batch(self, waiting_requests: list) -> dict:
"""Schedule a batch respecting multi-modal constraints."""
batch_text = []
batch_encoder = []
total_tokens = 0
encoder_items = 0
for req in waiting_requests:
req_tokens = req.text_tokens + req.media_tokens
# Check token budget
if total_tokens + req_tokens > self.max_batch_tokens:
break
# Check encoder budget
if req.has_media and not req.media_encoded:
if encoder_items + 1 > self.max_encoder_batch:
continue # Skip, try text-only requests
encoder_items += 1
batch_encoder.append(req)
batch_text.append(req)
total_tokens += req_tokens
return {
"decode_batch": batch_text,
"encoder_batch": batch_encoder,
"total_tokens": total_tokens
}
Encoder forward passes (vision encoder, audio encoder) are expensive and run only during prefill. Once a request’s media is encoded, the resulting tokens are cached and the encoder is not needed during decode steps. The scheduler separates encoder batch (requests needing encoding) from decode batch (requests generating tokens).
Memory Spikes During Encoding
Vision and audio encoders temporarily consume significant GPU memory:
# Memory analysis for video encoding
# ViT-L/14 encoder: ~0.3 GB model weights
# Forward pass activation memory for 30 frames at 224x224:
# Input: 30 * 3 * 224 * 224 * 2 bytes = 8.6 MB
# Intermediate activations: ~30 * 150 MB = 4.5 GB
# Output tokens: 30 * 256 * 1024 * 2 = 15 MB
# Peak transient memory: ~4.5 GB
# This spike must fit within the GPU memory budget
# alongside model weights and KV cache
Encoder Memory Spike by Modality
vLLM reserves headroom in the memory budget for encoder spikes by reducing the KV cache allocation when multi-modal serving is enabled:
# Reduce KV cache to leave room for encoder activations
python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-v1.6-34b-hf \
--gpu-memory-utilization 0.85 \
--max-num-seqs 32 \
--limit-mm-per-prompt image=4,video=1
KV Cache Impact
Multi-modal tokens consume KV cache just like text tokens. A video with 7,680 visual tokens consumes the same cache as a 7,680-token text prompt:
# KV cache consumption for multi-modal request
# Llama 70B, block_size=16, 5.24 MB per block
# Text-only request (512 tokens):
# Blocks: 512 / 16 = 32 blocks
# Cache: 32 * 5.24 MB = 167.7 MB
# Video request (512 text + 7,680 video tokens):
# Total tokens: 8,192
# Blocks: 8,192 / 16 = 512 blocks
# Cache: 512 * 5.24 MB = 2,682.9 MB = 2.68 GB
# One video request uses 16x more KV cache than text-only
KV Cache Consumption by Request Type — Llama 70B
| Request Type | Total Tokens | Blocks | KV Cache (MB) | vs Text-Only |
|---|---|---|---|---|
| Text (512 tok) | 512 | 32 | 168 | 1.0x |
| Image (512 + 256) | 768 | 48 | 252 | 1.5x |
| Video 10s (512 + 2560) | 3,072 | 192 | 1,006 | 6.0x |
| Video 30s (512 + 7680) | 8,192 | 512 | 2,683 | 16.0x |
| Audio 30s (512 + 1500) | 2,012 | 126 | 660 | 3.9x |
| Video+Audio (512 + 7680 + 1500) | 9,692 | 606 | 3,175 | 18.9x |
Encoder Caching
When multiple requests reference the same video or audio file, vLLM can cache encoder outputs:
class EncoderCache:
def __init__(self, max_size_mb: int = 2048):
self.cache = {} # hash -> (tokens, last_access, size_bytes)
self.max_size = max_size_mb * 1024 * 1024
self.current_size = 0
def get_or_encode(self, media_hash: str,
encoder: callable,
media_data: torch.Tensor) -> torch.Tensor:
if media_hash in self.cache:
tokens, _, size = self.cache[media_hash]
self.cache[media_hash] = (tokens, time.time(), size)
return tokens
tokens = encoder(media_data)
size = tokens.numel() * tokens.element_size()
# Evict if necessary
while self.current_size + size > self.max_size:
self._evict_lru()
self.cache[media_hash] = (tokens, time.time(), size)
self.current_size += size
return tokens
def _evict_lru(self):
oldest_key = min(self.cache, key=lambda k: self.cache[k][1])
_, _, size = self.cache.pop(oldest_key)
self.current_size -= size
Performance Benchmarks
Multi-Modal Request Throughput — Llava 34B, 4xA100
| Workload Mix | Requests/min | Output tok/s | Avg TTFT (ms) | GPU Mem Used (GB) |
|---|---|---|---|---|
| 100% text | 142 | 4,580 | 48 | 68 |
| 80% text + 20% image | 118 | 3,810 | 85 | 72 |
| 50% text + 50% image | 89 | 2,870 | 142 | 74 |
| 80% text + 20% video (10s) | 72 | 2,320 | 245 | 76 |
| 50% text + 50% video (10s) | 38 | 1,225 | 520 | 78 |
| Mixed: text+image+video+audio | 55 | 1,774 | 310 | 76 |
Video-heavy workloads reduce throughput by 3-4x compared to text-only due to the large token counts and encoder overhead.
Optimization Strategies
Dynamic Resolution Scaling
Reduce frame resolution based on current load:
class AdaptiveVideoProcessor:
def __init__(self):
self.resolution_tiers = [
(448, 448), # High quality
(336, 336), # Medium
(224, 224), # Low (default)
]
def select_resolution(self, gpu_memory_pressure: float,
batch_size: int) -> tuple:
if gpu_memory_pressure > 0.9 or batch_size > 64:
return self.resolution_tiers[2] # Low
elif gpu_memory_pressure > 0.8 or batch_size > 32:
return self.resolution_tiers[1] # Medium
else:
return self.resolution_tiers[0] # High
Temporal Stride Adaptation
Skip more frames when the system is under load:
def adaptive_frame_sampling(video_duration: float,
current_load: float) -> int:
"""Reduce frames when system is loaded."""
base_fps = 1.0
if current_load > 0.8:
effective_fps = 0.5 # 1 frame every 2 seconds
elif current_load > 0.6:
effective_fps = 0.75
else:
effective_fps = base_fps
num_frames = int(video_duration * effective_fps)
return max(4, min(num_frames, 32))
Tokens per 30s Video by Resolution
Summary
vLLM v1 handles video and audio modalities through dedicated encoder pipelines that produce tokens interleaved with text. Video processing involves frame sampling (1 FPS default, capped at 32 frames), per-frame ViT encoding (256-1,024 tokens per frame), and optional temporal pooling for token reduction. Audio processing converts waveforms to log-mel spectrograms and encodes them through a Whisper-style encoder with 2x temporal downsampling. The key challenge is KV cache consumption: a 30-second video generates 7,680+ visual tokens, consuming 16x more cache than a text-only request. The scheduler manages this through separate encoder and decode batches, dynamic resolution scaling, and encoder output caching. Production deployments should limit video frame counts, use the lowest acceptable resolution, and provision 3-4x more KV cache capacity than text-only workloads require.