KV Cache Allocators: Memory Pools, Fragmentation, and Predictable LLM Serving

A production LLM server at 100 requests/sec allocates and frees KV cache 100 times per second. Each allocation is variable-size (100-4096 tokens) and long-lived (10-60 seconds per request). Naive per-request allocation creates fragmentation: after an hour, 30% of your GPU memory is scattered free blocks too small to fit new requests, and your effective capacity drops from 80 GB to 56 GB. P99 latency spikes by 3-5x because new requests queue waiting for fragmentation to clear. A proper KV cache allocator — using fixed-size blocks, memory pooling, and prefix sharing — eliminates fragmentation and stabilizes P99 latency at the P50 level.

This post treats KV cache management as an allocator problem.

KV cache as an allocation workload

Each request grows by 1 token per decode step:

append K/V for every layer
shape per token per layer: $2 \cdot H \cdot d \cdot bytes$

Total bytes per token:

def kv_bytes_per_token(layers, heads, head_dim, dtype_bytes=2):
    return 2 * layers * heads * head_dim * dtype_bytes

Total bytes for a request of length $L$ : $Bytes(L) = L \cdot BytesPerToken$

Why naive contiguous allocation fragments

If you allocate a contiguous block sized for “max length” or allocate and reallocate growing buffers, you create holes of many sizes.

Classic symptom: “free memory is large, but no block is large enough.”

📊

Allocator behavior under mixed request lengths

Strategy	Fragmentation	p99 latency risk	Implementation cost
Contiguous per request	High	High	Low
Page/block allocator	Low	Low	Medium
Buddy allocator	Medium	Medium	High

The fix: page/block allocation

Instead of “one big buffer per request,” allocate fixed-size pages (e.g., 16 tokens per page).

Then each request is a page list:

class PagePool:
    def __init__(self, total_pages, page_tokens=16):
        self.page_tokens = page_tokens
        self.free = list(range(total_pages))
        self.in_use = set()

    def alloc(self):
        if not self.free:
            return None
        p = self.free.pop()
        self.in_use.add(p)
        return p

    def free_page(self, p):
        if p in self.in_use:
            self.in_use.remove(p)
            self.free.append(p)

class RequestKV:
    def __init__(self):
        self.pages = []   # list[int]
        self.length = 0

    def ensure_capacity(self, pool, new_len):
        while len(self.pages) * pool.page_tokens < new_len:
            p = pool.alloc()
            if p is None:
                return False
            self.pages.append(p)
        return True

Fragmentation collapses because all free blocks are identical.

Page size is a real tuning knob

Small pages:

better packing for short requests
more page table overhead

Large pages:

lower overhead
worse internal fragmentation for short requests

Conceptual trade-off: page size

line

Metric	8	16	32	64
Overhead (relative)	1	0.6	0.4	0.3
Internal frag (relative)	0.4	0.6	0.8	1

Many requests share long prefixes (prompt + early tokens). Beam search shares prefixes until divergence.

If you copy KV pages per branch, you explode memory. Prefer copy-on-write semantics:

pages are reference-counted
only fork pages when a write occurs

class RefCountPool(PagePool):
    def __init__(self, total_pages, page_tokens=16):
        super().__init__(total_pages, page_tokens)
        self.refcnt = {}

    def incref(self, p):
        self.refcnt[p] = self.refcnt.get(p, 0) + 1

    def decref(self, p):
        self.refcnt[p] -= 1
        if self.refcnt[p] == 0:
            del self.refcnt[p]
            self.free_page(p)

p99 stability: admission control based on pages

Instead of “admit request if bytes fit,” admit if pages are available for its expected growth:

def can_admit(pool_free_pages, est_tokens, page_tokens=16):
    needed_pages = (est_tokens + page_tokens - 1) // page_tokens
    return pool_free_pages >= needed_pages

This makes p99 predictable because you never overcommit the KV cache.

💡 Serving is allocator math

Stable p99 is not achieved by “faster kernels” alone. It’s achieved by never forcing batch collapse due to allocator failure.

Illustrative results (mixed workload)

📊

KV allocator impact (example workload)

Allocator	Utilization	OOM/evictions	p99 latency	Throughput
Contiguous	65%	Frequent	480 ms	1.0x
Paged pool	94%	Rare	180 ms	1.8x
Paged + prefix share	96%	Rare	160 ms	2.0x

Conclusion

Treat KV cache as an allocator problem:

fixed-size pages kill fragmentation
pooling makes allocation constant-time
prefix sharing prevents beam/speculative blow-ups
admission control keeps p99 stable

Once KV cache is predictable, all the other optimizations (batching, CUDA graphs, flash attention) finally pay off consistently.

KV cache as an allocation workload

Why naive contiguous allocation fragments

Allocator behavior under mixed request lengths

The fix: page/block allocation

Page size is a real tuning knob

Conceptual trade-off: page size

Prefix sharing: beams and speculative decoding need it

p99 stability: admission control based on pages

Illustrative results (mixed workload)

KV allocator impact (example workload)

Conclusion

Stanley Phoong

Related Posts

Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment

Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space

KV Cache: The Hidden Memory Giant in LLM Serving