Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal

Part of Series Inference Optimization Timeline 27 of 23

1 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 2 KV Cache: The Hidden Memory Giant in LLM Serving 3 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 4 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 5 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 6 Continuous Batching: The Complete Guide to LLM Inference Scheduling 7 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 8 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 9 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 10 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 11 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 12 Mamba and State Space Models: The O(n) Alternative to Attention 13 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 14 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 15 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 16 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 17 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 18 Memory Pool Management: Slab Allocators for GPU Inference 19 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 20 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 21 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 22 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 23 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification

Serving a text-only LLM is hard. Serving a vision-language model (VLM) is harder. A single 1024x1024 image at 14x14 patch size produces 5,329 visual tokens — equivalent to a 5,000-word text prompt in token count, but costing 10-50x more compute due to ViT encoding. Video is worse: 30 frames at 256 patches each = 7,680 tokens. The serving system must handle these asymmetric workloads without destroying text-only performance.

The VLM Inference Pipeline

VLM Inference Pipeline

1. Image Preprocessing Resize, normalize, patch CPU, 1-5ms per image

2. ViT Encoding Patches to embeddings GPU, 5-50ms per image

3. Projection ViT dim to LLM dim GPU, 0.5ms linear projection

4. Token Merging Interleave visual + text CPU, construct combined sequence

5. LLM Prefill Process all tokens GPU, attention over visual + text

6. LLM Decode Generate output tokens GPU, standard autoregressive

The Token Count Problem

A text prompt of 100 words is ~130 tokens. An image at standard resolution is 576-1024 tokens (ViT-L with 14x14 patches). A high-resolution image with dynamic resolution (LLaVA-1.6 style) can be 2,880 tokens (2x2 grid of 720-token sub-images).

def compute_visual_tokens(image_size, patch_size=14, pool_stride=2):
    """Compute number of visual tokens from image dimensions."""
    patches_h = image_size[0] // patch_size
    patches_w = image_size[1] // patch_size
    total_patches = patches_h * patches_w
    # Some models pool patches (reduce by 4x)
    if pool_stride > 1:
        total_patches = total_patches // (pool_stride ** 2)
    return total_patches + 1  # +1 for CLS token (some models)

# Examples:
# 224x224, patch=14: 16*16 = 256 tokens
# 336x336, patch=14: 24*24 = 576 tokens
# 672x672, patch=14: 48*48 = 2304 tokens (high-res)

📊

Visual Token Counts by Input Type

Input	Resolution	Visual Tokens	Equivalent Text	ViT Encode Time (A100)
Single image (standard)	336x336	576	~450 words	8 ms
Single image (high-res)	672x672	2,304	~1,800 words	25 ms
4-image grid	4 x 336x336	2,304	~1,800 words	32 ms (batched)
Video (30 frames)	30 x 224x224	7,680	~6,000 words	95 ms
Document (page scan)	1024x1024	5,329	~4,200 words	45 ms

KV Cache Impact

Visual tokens produce KV cache entries just like text tokens. For Llama 3.2-Vision 90B with GQA-8:

$\text{KV per visual token per layer} = 2 \times 8 \times 128 \times 2 = 4{,}096 \text{ bytes}$

A 576-token image across 80 layers: $576 \times 80 \times 4{,}096 = 188.7$ MB of KV cache. Just for ONE image.

KV Cache per Request Type (Llama 3.2 Vision 90B)

(MB)

Text only (500 tokens) 164 MB

164 MB

Text + 1 image (1076 tokens) 352 MB

352 MB

Text + 4 images (2804 tokens) 919 MB

919 MB

Text + video (8180 tokens) 2.68 GB

2,680 MB

⚠️ Video Kills Concurrency

A video request consumes 2.68 GB of KV cache — 16x more than a text-only request. On an 80 GB GPU with 50 GB for KV cache: 18 concurrent text requests OR 1 video request plus a few text requests. The serving system must account for this asymmetry in its scheduling and admission control.

Serving Architecture Patterns

Pattern 1: Integrated (Single GPU)

ViT encoder and LLM run on the same GPU. Simple but ViT encoding blocks LLM decode:

def integrated_forward(model, text_tokens, images):
    # ViT encoding (blocks decode for other requests)
    visual_tokens = model.vit(images)  # 5-50ms
    visual_embeds = model.projection(visual_tokens)

    # Merge visual + text
    combined = merge_tokens(text_tokens, visual_embeds, image_positions)

    # Standard LLM forward
    return model.llm(combined)

Pattern 2: Disaggregated Encoder

Dedicated encoder GPUs process images asynchronously. LLM GPUs only handle text + pre-computed visual embeddings:

# Encoder GPU (separate process/node)
async def encode_images(images):
    visual_tokens = vit_model(images)
    embeddings = projection(visual_tokens)
    return embeddings  # Send to LLM GPU via NIXL/NVLink

# LLM GPU (receives pre-computed embeddings)
async def llm_forward(text_tokens, visual_embeddings):
    combined = merge_tokens(text_tokens, visual_embeddings)
    return llm_model(combined)

This is the E/P/D/G pattern from the vLLM v1 & Omni series.

📊

Integrated vs Disaggregated VLM Serving

Property	Integrated	Disaggregated
ViT blocking LLM decode?	Yes (decode stalls during encoding)	No (separate GPU)
GPU utilization	~60% (idle during ViT for decode, idle during LLM for encode)	~85% (each GPU specialized)
Complexity	Low (single process)	High (multi-node coordination)
Best for	Low-traffic, mixed workloads	High-traffic, image-heavy workloads

Reviewer Agent Validation

Challenge: Compute the maximum number of concurrent requests an 80GB H100 can handle for: (a) text-only 500-token requests, (b) text + 1 image requests, given Llama 3.2 Vision 90B in FP16 (model weights ~180 GB, requires 3 GPUs for weights, KV cache budget per GPU ~50 GB).

Expected:

(a) 500 tokens x 80 layers x 4096 bytes/token/layer = 164 MB per request. 50 GB / 164 MB = ~305 concurrent text requests.
(b) 1076 tokens x 80 layers x 4096 = 352 MB per request. 50 GB / 352 MB = ~142 concurrent multimodal requests.