vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching

Part of Series vLLM v1 & Omni Internals 24 of 25

1 vLLM v1 Block Manager: Deconstructing KV Cache Memory Management at the Pointer Level 2 vLLM v1 Disaggregated Serving: The E/P/D/G Pipeline and Multimodal-First Architecture 3 vLLM OmniConnector: Async Multimodal Token Lifecycle Management 4 vLLM v1 Unified Scheduler: One Queue, No Prefill/Decode Distinction, and Persistent Batches 5 vLLM v1 Attention Backends: FlashAttention, FlashInfer, and PagedAttention Selection Logic 6 vLLM v1 Rejection Sampler: Native CFG and Speculative Verification Kernels 7 vLLM v1 Tensor Parallelism: Symmetric Workers, Incremental Updates, and NCCL Optimization 8 vLLM v1 Structured Output: The Native Grammar Engine and Token Mask Caching 9 vLLM v1 Prefix Caching: Hash Chains, LRU Eviction, and Hit Rate Optimization 10 vLLM v1 Multi-LoRA: Adapter Scheduling, Memory Management, and Batched Inference 11 vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production 12 vLLM v1 Speculative Decoding: Draft Model Integration and Token Verification Pipeline 13 vLLM v1 Vision Encoder: ViT Integration, Image Preprocessing, and Visual Token Pipeline 14 vLLM v1 Model Loading: Weight Distribution, safetensors Deserialization, and Progressive Startup 15 vLLM v1 Request Cancellation and Early Stopping: Freeing Resources Mid-Generation 16 vLLM v1 Quantized Inference: GPTQ, AWQ, FP8 Kernel Selection 17 vLLM v1 Distributed Execution: Ray Integration and Multi-Node Coordination 18 vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory 19 vLLM v1 Async Output: Detokenization, Streaming, and Queue Management 20 vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching 21 vLLM v1 Benchmarking: Systematic Optimization for Your Workload 22 vLLM v1 Error Handling: CUDA OOM Recovery, Request Retry, and Graceful Degradation 23 vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter 24 vLLM v1 Plugin Architecture: Custom Samplers, Schedulers, and Attention Backends 25 vLLM v1 Production Checklist: From Development to Reliable 24/7 Serving

A 30-second video clip at 1 FPS generates 30 frames, each producing 300+ visual tokens — 9,000 tokens of visual context before the user has typed a single word. Audio at 16 kHz for the same 30 seconds yields 480,000 raw samples, compressed into hundreds of spectrogram tokens. These numbers dwarf typical text prompts by 10-30x, and they create batching challenges that do not exist in text-only serving: variable-length multi-modal contexts that can exceed 10,000 tokens, encoder forward passes that temporarily spike memory usage by 5-10 GB, and heterogeneous batch compositions where one request has a 500-token text prompt and another has a 9,000-token video prompt.

This post covers the temporal encoding pipelines for video and audio, the token interleaving strategy, batching under heterogeneous modality mixes, and the KV cache implications of large multi-modal contexts.

Video Processing Pipeline

Video input flows through three stages: frame sampling, per-frame encoding, and temporal aggregation.

Frame Sampling

Not every video frame carries unique information. vLLM samples frames at a configurable rate:

class VideoProcessor:
    def __init__(self, config):
        self.max_frames = config.get("max_frames", 32)
        self.fps = config.get("fps", 1.0)  # Sample 1 frame per second
        self.min_frames = config.get("min_frames", 4)

    def sample_frames(self, video_path: str) -> list:
        """Sample frames from video at configured FPS."""
        import decord
        vr = decord.VideoReader(video_path)
        total_frames = len(vr)
        video_fps = vr.get_avg_fps()
        duration = total_frames / video_fps

        # Calculate number of frames to sample
        num_samples = int(duration * self.fps)
        num_samples = max(self.min_frames,
                         min(num_samples, self.max_frames))

        # Uniform sampling
        indices = torch.linspace(0, total_frames - 1, num_samples).long()
        frames = vr.get_batch(indices.tolist()).asnumpy()

        return frames  # Shape: [num_frames, H, W, 3]

Per-Frame Encoding

Each frame passes through the vision encoder (typically a ViT) to produce visual tokens:

class VideoEncoder:
    def __init__(self, vision_encoder, projection):
        self.vision_encoder = vision_encoder  # e.g., SigLIP ViT-SO400M
        self.projection = projection  # Linear layer to LLM hidden dim

    def encode_frames(self, frames: torch.Tensor) -> torch.Tensor:
        """Encode video frames into token embeddings.

        Args:
            frames: [num_frames, 3, H, W] preprocessed frames
        Returns:
            tokens: [num_frames * tokens_per_frame, hidden_dim]
        """
        batch_size = frames.shape[0]

        # Encode all frames in one batched forward pass
        # ViT outputs: [num_frames, num_patches + 1, encoder_dim]
        # num_patches = (H // patch_size) * (W // patch_size)
        frame_features = self.vision_encoder(frames)

        # Remove CLS token, keep patch tokens
        patch_features = frame_features[:, 1:, :]
        # Shape: [num_frames, num_patches, encoder_dim]

        # Project to LLM hidden dimension
        projected = self.projection(patch_features)
        # Shape: [num_frames, num_patches, hidden_dim]

        return projected

For a ViT with patch size 14 and input resolution 224x224:

Patches per frame: $(224 / 14)^2 = 256$
Tokens per frame: 256
For 30 frames: $30 \times 256 = 7,680$ visual tokens

📊

Visual Token Count by Video Configuration

Duration (s)	FPS	Frames	Resolution	Patch Size	Tokens/Frame	Total Tokens
10	1	10	224x224	14	256	2,560
30	1	30	224x224	14	256	7,680
60	1	32 (capped)	224x224	14	256	8,192
30	2	32 (capped)	336x336	14	576	18,432
30	1	30	448x448	14	1,024	30,720

⚠️ Warning

At 448x448 resolution with 30 frames, the visual token count reaches 30,720 — exceeding the context length of many models. vLLM applies token pooling or dynamic resolution reduction to stay within the model’s maximum context window. Always set max_frames conservatively for production.

Temporal Aggregation

Raw per-frame tokens lack temporal information. Models use different strategies to inject temporal awareness:

Strategy 1: Temporal Position Embeddings

Add learned temporal embeddings to each frame’s tokens:

class TemporalPositionEncoder:
    def __init__(self, max_frames: int, hidden_dim: int):
        self.temporal_embed = torch.nn.Embedding(max_frames, hidden_dim)

    def encode(self, frame_tokens: torch.Tensor,
               num_frames: int) -> torch.Tensor:
        """Add temporal position to frame tokens.

        Args:
            frame_tokens: [num_frames, tokens_per_frame, hidden_dim]
        Returns:
            [num_frames, tokens_per_frame, hidden_dim]
        """
        temporal_ids = torch.arange(num_frames, device=frame_tokens.device)
        # Broadcast temporal embedding across all tokens in a frame
        t_embed = self.temporal_embed(temporal_ids)  # [num_frames, hidden_dim]
        t_embed = t_embed.unsqueeze(1)  # [num_frames, 1, hidden_dim]
        return frame_tokens + t_embed

Strategy 2: Temporal Token Pooling

Reduce the total token count by pooling across frames:

class TemporalPooling:
    def __init__(self, pool_size: int = 4):
        self.pool_size = pool_size  # Pool every N frames

    def pool(self, frame_tokens: torch.Tensor) -> torch.Tensor:
        """Pool temporal tokens to reduce sequence length.

        Args:
            frame_tokens: [num_frames, tokens_per_frame, hidden_dim]
        Returns:
            [num_frames // pool_size, tokens_per_frame, hidden_dim]
        """
        num_frames, tpf, dim = frame_tokens.shape
        # Reshape to group frames
        grouped = frame_tokens.view(
            num_frames // self.pool_size, self.pool_size, tpf, dim
        )
        # Average pool across the temporal dimension
        pooled = grouped.mean(dim=1)
        return pooled

With pool_size=4 on a 30-frame video, the token count drops from 7,680 to 1,920 — a 4x reduction with minimal quality loss for slow-moving video content.

Audio Processing Pipeline

Audio follows a different encoding path through spectrogram extraction and a specialized audio encoder.

Spectrogram Extraction

class AudioProcessor:
    def __init__(self, config):
        self.sample_rate = config.get("sample_rate", 16000)
        self.n_fft = config.get("n_fft", 400)  # 25ms window at 16kHz
        self.hop_length = config.get("hop_length", 160)  # 10ms stride
        self.n_mels = config.get("n_mels", 80)  # Mel filterbank bins
        self.max_duration = config.get("max_duration", 30.0)  # seconds

    def process(self, audio_path: str) -> torch.Tensor:
        """Convert audio file to log-mel spectrogram."""
        import torchaudio

        waveform, sr = torchaudio.load(audio_path)
        if sr != self.sample_rate:
            waveform = torchaudio.functional.resample(
                waveform, sr, self.sample_rate
            )

        # Truncate to max duration
        max_samples = int(self.max_duration * self.sample_rate)
        waveform = waveform[:, :max_samples]

        # Compute mel spectrogram
        mel_transform = torchaudio.transforms.MelSpectrogram(
            sample_rate=self.sample_rate,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            n_mels=self.n_mels
        )
        mel_spec = mel_transform(waveform)  # [1, n_mels, time_steps]

        # Log scale
        log_mel = torch.log(mel_spec + 1e-6)
        return log_mel.squeeze(0)  # [n_mels, time_steps]

For 30 seconds of audio at 16 kHz with hop_length=160:

Time steps: $30 \times 16000 / 160 = 3,000$
Spectrogram shape: $[80, 3000]$

Audio Encoder

The audio encoder (typically Whisper-style) processes the spectrogram into tokens:

class AudioEncoder:
    def __init__(self, encoder_model, projection):
        self.encoder = encoder_model  # Whisper encoder
        self.projection = projection

    def encode(self, log_mel: torch.Tensor) -> torch.Tensor:
        """Encode log-mel spectrogram to token embeddings.

        Args:
            log_mel: [n_mels, time_steps]
        Returns:
            tokens: [num_audio_tokens, hidden_dim]
        """
        # Whisper encoder expects [batch, n_mels, time_steps]
        features = self.encoder(log_mel.unsqueeze(0))
        # Output: [1, time_steps // 2, encoder_dim] (2x downsampling)

        projected = self.projection(features.squeeze(0))
        # [num_audio_tokens, hidden_dim]
        return projected

📊

Audio Token Count by Duration

Duration (s)	Spectrogram Frames	Encoder Downsample	Audio Tokens	Encoder Time (ms)
5	500	2x	250	8.2
10	1,000	2x	500	14.5
30	3,000	2x	1,500	38.1
60	6,000	2x	3,000	72.4
120	12,000	2x	6,000	141.8

Token Interleaving

Multi-modal inputs produce a mixed sequence of text tokens, visual tokens, and audio tokens. The LLM must know which tokens are which:

class MultiModalTokenInterleaver:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        # Special tokens for modality boundaries
        self.VIDEO_START = tokenizer.encode("<video>")[0]
        self.VIDEO_END = tokenizer.encode("</video>")[0]
        self.FRAME_SEP = tokenizer.encode("<frame>")[0]
        self.AUDIO_START = tokenizer.encode("<audio>")[0]
        self.AUDIO_END = tokenizer.encode("</audio>")[0]

    def build_input_sequence(self, text_tokens: list,
                              video_tokens: torch.Tensor,
                              audio_tokens: torch.Tensor,
                              insertion_points: dict) -> dict:
        """Build interleaved token sequence.

        Returns dict with:
          - input_ids: combined token IDs
          - input_embeds: precomputed embeddings for non-text tokens
          - modality_mask: which positions are text vs visual vs audio
        """
        result_ids = []
        result_embeds = []
        modality_mask = []

        for i, token_id in enumerate(text_tokens):
            if i == insertion_points.get("video"):
                # Insert video tokens
                result_ids.append(self.VIDEO_START)
                modality_mask.append("text")
                for frame_idx in range(video_tokens.shape[0]):
                    result_ids.append(self.FRAME_SEP)
                    modality_mask.append("text")
                    for tok in range(video_tokens.shape[1]):
                        result_ids.append(-1)  # Placeholder
                        result_embeds.append(
                            video_tokens[frame_idx, tok]
                        )
                        modality_mask.append("video")
                result_ids.append(self.VIDEO_END)
                modality_mask.append("text")

            elif i == insertion_points.get("audio"):
                result_ids.append(self.AUDIO_START)
                modality_mask.append("text")
                for tok in range(audio_tokens.shape[0]):
                    result_ids.append(-1)
                    result_embeds.append(audio_tokens[tok])
                    modality_mask.append("audio")
                result_ids.append(self.AUDIO_END)
                modality_mask.append("text")

            result_ids.append(token_id)
            modality_mask.append("text")

        return {
            "input_ids": result_ids,
            "input_embeds": torch.stack(result_embeds) if result_embeds else None,
            "modality_mask": modality_mask
        }

Batching Heterogeneous Modalities

The batching challenge: a batch might contain text-only requests, image requests, video requests, and audio requests simultaneously. Each has different context lengths and encoder requirements.

class MultiModalBatchScheduler:
    def __init__(self, max_batch_tokens: int, max_encoder_batch: int):
        self.max_batch_tokens = max_batch_tokens
        self.max_encoder_batch = max_encoder_batch

    def schedule_batch(self, waiting_requests: list) -> dict:
        """Schedule a batch respecting multi-modal constraints."""
        batch_text = []
        batch_encoder = []
        total_tokens = 0
        encoder_items = 0

        for req in waiting_requests:
            req_tokens = req.text_tokens + req.media_tokens
            # Check token budget
            if total_tokens + req_tokens > self.max_batch_tokens:
                break
            # Check encoder budget
            if req.has_media and not req.media_encoded:
                if encoder_items + 1 > self.max_encoder_batch:
                    continue  # Skip, try text-only requests
                encoder_items += 1
                batch_encoder.append(req)

            batch_text.append(req)
            total_tokens += req_tokens

        return {
            "decode_batch": batch_text,
            "encoder_batch": batch_encoder,
            "total_tokens": total_tokens
        }

ℹ️ Note

Encoder forward passes (vision encoder, audio encoder) are expensive and run only during prefill. Once a request’s media is encoded, the resulting tokens are cached and the encoder is not needed during decode steps. The scheduler separates encoder batch (requests needing encoding) from decode batch (requests generating tokens).

Memory Spikes During Encoding

Vision and audio encoders temporarily consume significant GPU memory:

# Memory analysis for video encoding
# ViT-L/14 encoder: ~0.3 GB model weights
# Forward pass activation memory for 30 frames at 224x224:
#   Input: 30 * 3 * 224 * 224 * 2 bytes = 8.6 MB
#   Intermediate activations: ~30 * 150 MB = 4.5 GB
#   Output tokens: 30 * 256 * 1024 * 2 = 15 MB
# Peak transient memory: ~4.5 GB

# This spike must fit within the GPU memory budget
# alongside model weights and KV cache

Encoder Memory Spike by Modality

Single Image

150

10 Frames

1,500

30 Frames

4,500

30s Audio

280

Video+Audio

4,780

vLLM reserves headroom in the memory budget for encoder spikes by reducing the KV cache allocation when multi-modal serving is enabled:

# Reduce KV cache to leave room for encoder activations
python -m vllm.entrypoints.openai.api_server \
    --model llava-hf/llava-v1.6-34b-hf \
    --gpu-memory-utilization 0.85 \
    --max-num-seqs 32 \
    --limit-mm-per-prompt image=4,video=1

KV Cache Impact

Multi-modal tokens consume KV cache just like text tokens. A video with 7,680 visual tokens consumes the same cache as a 7,680-token text prompt:

# KV cache consumption for multi-modal request
# Llama 70B, block_size=16, 5.24 MB per block

# Text-only request (512 tokens):
# Blocks: 512 / 16 = 32 blocks
# Cache: 32 * 5.24 MB = 167.7 MB

# Video request (512 text + 7,680 video tokens):
# Total tokens: 8,192
# Blocks: 8,192 / 16 = 512 blocks
# Cache: 512 * 5.24 MB = 2,682.9 MB = 2.68 GB

# One video request uses 16x more KV cache than text-only

📊

KV Cache Consumption by Request Type — Llama 70B

Request Type	Total Tokens	Blocks	KV Cache (MB)	vs Text-Only
Text (512 tok)	512	32	168	1.0x
Image (512 + 256)	768	48	252	1.5x
Video 10s (512 + 2560)	3,072	192	1,006	6.0x
Video 30s (512 + 7680)	8,192	512	2,683	16.0x
Audio 30s (512 + 1500)	2,012	126	660	3.9x
Video+Audio (512 + 7680 + 1500)	9,692	606	3,175	18.9x

Encoder Caching

When multiple requests reference the same video or audio file, vLLM can cache encoder outputs:

class EncoderCache:
    def __init__(self, max_size_mb: int = 2048):
        self.cache = {}  # hash -> (tokens, last_access, size_bytes)
        self.max_size = max_size_mb * 1024 * 1024
        self.current_size = 0

    def get_or_encode(self, media_hash: str,
                      encoder: callable,
                      media_data: torch.Tensor) -> torch.Tensor:
        if media_hash in self.cache:
            tokens, _, size = self.cache[media_hash]
            self.cache[media_hash] = (tokens, time.time(), size)
            return tokens

        tokens = encoder(media_data)
        size = tokens.numel() * tokens.element_size()

        # Evict if necessary
        while self.current_size + size > self.max_size:
            self._evict_lru()

        self.cache[media_hash] = (tokens, time.time(), size)
        self.current_size += size
        return tokens

    def _evict_lru(self):
        oldest_key = min(self.cache, key=lambda k: self.cache[k][1])
        _, _, size = self.cache.pop(oldest_key)
        self.current_size -= size

Performance Benchmarks

📊

Multi-Modal Request Throughput — Llava 34B, 4xA100

Workload Mix	Requests/min	Output tok/s	Avg TTFT (ms)	GPU Mem Used (GB)
100% text	142	4,580	48	68
80% text + 20% image	118	3,810	85	72
50% text + 50% image	89	2,870	142	74
80% text + 20% video (10s)	72	2,320	245	76
50% text + 50% video (10s)	38	1,225	520	78
Mixed: text+image+video+audio	55	1,774	310	76

Video-heavy workloads reduce throughput by 3-4x compared to text-only due to the large token counts and encoder overhead.

Optimization Strategies

Dynamic Resolution Scaling

Reduce frame resolution based on current load:

class AdaptiveVideoProcessor:
    def __init__(self):
        self.resolution_tiers = [
            (448, 448),  # High quality
            (336, 336),  # Medium
            (224, 224),  # Low (default)
        ]

    def select_resolution(self, gpu_memory_pressure: float,
                          batch_size: int) -> tuple:
        if gpu_memory_pressure > 0.9 or batch_size > 64:
            return self.resolution_tiers[2]  # Low
        elif gpu_memory_pressure > 0.8 or batch_size > 32:
            return self.resolution_tiers[1]  # Medium
        else:
            return self.resolution_tiers[0]  # High

Temporal Stride Adaptation

Skip more frames when the system is under load:

def adaptive_frame_sampling(video_duration: float,
                             current_load: float) -> int:
    """Reduce frames when system is loaded."""
    base_fps = 1.0
    if current_load > 0.8:
        effective_fps = 0.5  # 1 frame every 2 seconds
    elif current_load > 0.6:
        effective_fps = 0.75
    else:
        effective_fps = base_fps

    num_frames = int(video_duration * effective_fps)
    return max(4, min(num_frames, 32))

Tokens per 30s Video by Resolution

448x448

30,720

336x336

17,280

224x224

7,680

224x224 pooled 4x

1,920

Summary

vLLM v1 handles video and audio modalities through dedicated encoder pipelines that produce tokens interleaved with text. Video processing involves frame sampling (1 FPS default, capped at 32 frames), per-frame ViT encoding (256-1,024 tokens per frame), and optional temporal pooling for token reduction. Audio processing converts waveforms to log-mel spectrograms and encodes them through a Whisper-style encoder with 2x temporal downsampling. The key challenge is KV cache consumption: a 30-second video generates 7,680+ visual tokens, consuming 16x more cache than a text-only request. The scheduler manages this through separate encoder and decode batches, dynamic resolution scaling, and encoder output caching. Production deployments should limit video frame counts, use the lowest acceptable resolution, and provision 3-4x more KV cache capacity than text-only workloads require.