Serving a text-only LLM is hard. Serving a vision-language model (VLM) is harder. A single 1024x1024 image at 14x14 patch size produces 5,329 visual tokens — equivalent to a 5,000-word text prompt in token count, but costing 10-50x more compute due to ViT encoding. Video is worse: 30 frames at 256 patches each = 7,680 tokens. The serving system must handle these asymmetric workloads without destroying text-only performance.
The VLM Inference Pipeline
VLM Inference Pipeline
The Token Count Problem
A text prompt of 100 words is ~130 tokens. An image at standard resolution is 576-1024 tokens (ViT-L with 14x14 patches). A high-resolution image with dynamic resolution (LLaVA-1.6 style) can be 2,880 tokens (2x2 grid of 720-token sub-images).
def compute_visual_tokens(image_size, patch_size=14, pool_stride=2):
"""Compute number of visual tokens from image dimensions."""
patches_h = image_size[0] // patch_size
patches_w = image_size[1] // patch_size
total_patches = patches_h * patches_w
# Some models pool patches (reduce by 4x)
if pool_stride > 1:
total_patches = total_patches // (pool_stride ** 2)
return total_patches + 1 # +1 for CLS token (some models)
# Examples:
# 224x224, patch=14: 16*16 = 256 tokens
# 336x336, patch=14: 24*24 = 576 tokens
# 672x672, patch=14: 48*48 = 2304 tokens (high-res)
Visual Token Counts by Input Type
| Input | Resolution | Visual Tokens | Equivalent Text | ViT Encode Time (A100) |
|---|---|---|---|---|
| Single image (standard) | 336x336 | 576 | ~450 words | 8 ms |
| Single image (high-res) | 672x672 | 2,304 | ~1,800 words | 25 ms |
| 4-image grid | 4 x 336x336 | 2,304 | ~1,800 words | 32 ms (batched) |
| Video (30 frames) | 30 x 224x224 | 7,680 | ~6,000 words | 95 ms |
| Document (page scan) | 1024x1024 | 5,329 | ~4,200 words | 45 ms |
KV Cache Impact
Visual tokens produce KV cache entries just like text tokens. For Llama 3.2-Vision 90B with GQA-8:
A 576-token image across 80 layers: MB of KV cache. Just for ONE image.
KV Cache per Request Type (Llama 3.2 Vision 90B)
(MB)A video request consumes 2.68 GB of KV cache — 16x more than a text-only request. On an 80 GB GPU with 50 GB for KV cache: 18 concurrent text requests OR 1 video request plus a few text requests. The serving system must account for this asymmetry in its scheduling and admission control.
Serving Architecture Patterns
Pattern 1: Integrated (Single GPU)
ViT encoder and LLM run on the same GPU. Simple but ViT encoding blocks LLM decode:
def integrated_forward(model, text_tokens, images):
# ViT encoding (blocks decode for other requests)
visual_tokens = model.vit(images) # 5-50ms
visual_embeds = model.projection(visual_tokens)
# Merge visual + text
combined = merge_tokens(text_tokens, visual_embeds, image_positions)
# Standard LLM forward
return model.llm(combined)
Pattern 2: Disaggregated Encoder
Dedicated encoder GPUs process images asynchronously. LLM GPUs only handle text + pre-computed visual embeddings:
# Encoder GPU (separate process/node)
async def encode_images(images):
visual_tokens = vit_model(images)
embeddings = projection(visual_tokens)
return embeddings # Send to LLM GPU via NIXL/NVLink
# LLM GPU (receives pre-computed embeddings)
async def llm_forward(text_tokens, visual_embeddings):
combined = merge_tokens(text_tokens, visual_embeddings)
return llm_model(combined)
This is the E/P/D/G pattern from the vLLM v1 & Omni series.
Integrated vs Disaggregated VLM Serving
| Property | Integrated | Disaggregated |
|---|---|---|
| ViT blocking LLM decode? | Yes (decode stalls during encoding) | No (separate GPU) |
| GPU utilization | ~60% (idle during ViT for decode, idle during LLM for encode) | ~85% (each GPU specialized) |
| Complexity | Low (single process) | High (multi-node coordination) |
| Best for | Low-traffic, mixed workloads | High-traffic, image-heavy workloads |
Reviewer Agent Validation
Challenge: Compute the maximum number of concurrent requests an 80GB H100 can handle for: (a) text-only 500-token requests, (b) text + 1 image requests, given Llama 3.2 Vision 90B in FP16 (model weights ~180 GB, requires 3 GPUs for weights, KV cache budget per GPU ~50 GB).
Expected:
- (a) 500 tokens x 80 layers x 4096 bytes/token/layer = 164 MB per request. 50 GB / 164 MB = ~305 concurrent text requests.
- (b) 1076 tokens x 80 layers x 4096 = 352 MB per request. 50 GB / 352 MB = ~142 concurrent multimodal requests.