A production LLM server at 100 requests/sec allocates and frees KV cache 100 times per second. Each allocation is variable-size (100-4096 tokens) and long-lived (10-60 seconds per request). Naive per-request allocation creates fragmentation: after an hour, 30% of your GPU memory is scattered free blocks too small to fit new requests, and your effective capacity drops from 80 GB to 56 GB. P99 latency spikes by 3-5x because new requests queue waiting for fragmentation to clear. A proper KV cache allocator — using fixed-size blocks, memory pooling, and prefix sharing — eliminates fragmentation and stabilizes P99 latency at the P50 level.
This post treats KV cache management as an allocator problem.
KV cache as an allocation workload
Each request grows by 1 token per decode step:
- append K/V for every layer
- shape per token per layer:
Total bytes per token:
def kv_bytes_per_token(layers, heads, head_dim, dtype_bytes=2):
return 2 * layers * heads * head_dim * dtype_bytes
Total bytes for a request of length :
Why naive contiguous allocation fragments
If you allocate a contiguous block sized for “max length” or allocate and reallocate growing buffers, you create holes of many sizes.
Classic symptom: “free memory is large, but no block is large enough.”
Allocator behavior under mixed request lengths
| Strategy | Fragmentation | p99 latency risk | Implementation cost |
|---|---|---|---|
| Contiguous per request | High | High | Low |
| Page/block allocator | Low | Low | Medium |
| Buddy allocator | Medium | Medium | High |
The fix: page/block allocation
Instead of “one big buffer per request,” allocate fixed-size pages (e.g., 16 tokens per page).
Then each request is a page list:
class PagePool:
def __init__(self, total_pages, page_tokens=16):
self.page_tokens = page_tokens
self.free = list(range(total_pages))
self.in_use = set()
def alloc(self):
if not self.free:
return None
p = self.free.pop()
self.in_use.add(p)
return p
def free_page(self, p):
if p in self.in_use:
self.in_use.remove(p)
self.free.append(p)
class RequestKV:
def __init__(self):
self.pages = [] # list[int]
self.length = 0
def ensure_capacity(self, pool, new_len):
while len(self.pages) * pool.page_tokens < new_len:
p = pool.alloc()
if p is None:
return False
self.pages.append(p)
return True
Fragmentation collapses because all free blocks are identical.
Page size is a real tuning knob
Small pages:
- better packing for short requests
- more page table overhead
Large pages:
- lower overhead
- worse internal fragmentation for short requests
Conceptual trade-off: page size
line| Metric | 8 | 16 | 32 | 64 |
|---|---|---|---|---|
| Overhead (relative) | ||||
| Internal frag (relative) |
Prefix sharing: beams and speculative decoding need it
Many requests share long prefixes (prompt + early tokens). Beam search shares prefixes until divergence.
If you copy KV pages per branch, you explode memory. Prefer copy-on-write semantics:
- pages are reference-counted
- only fork pages when a write occurs
class RefCountPool(PagePool):
def __init__(self, total_pages, page_tokens=16):
super().__init__(total_pages, page_tokens)
self.refcnt = {}
def incref(self, p):
self.refcnt[p] = self.refcnt.get(p, 0) + 1
def decref(self, p):
self.refcnt[p] -= 1
if self.refcnt[p] == 0:
del self.refcnt[p]
self.free_page(p)
p99 stability: admission control based on pages
Instead of “admit request if bytes fit,” admit if pages are available for its expected growth:
def can_admit(pool_free_pages, est_tokens, page_tokens=16):
needed_pages = (est_tokens + page_tokens - 1) // page_tokens
return pool_free_pages >= needed_pages
This makes p99 predictable because you never overcommit the KV cache.
Stable p99 is not achieved by “faster kernels” alone. It’s achieved by never forcing batch collapse due to allocator failure.
Illustrative results (mixed workload)
KV allocator impact (example workload)
| Allocator | Utilization | OOM/evictions | p99 latency | Throughput |
|---|---|---|---|---|
| Contiguous | 65% | Frequent | 480 ms | 1.0x |
| Paged pool | 94% | Rare | 180 ms | 1.8x |
| Paged + prefix share | 96% | Rare | 160 ms | 2.0x |
Conclusion
Treat KV cache as an allocator problem:
- fixed-size pages kill fragmentation
- pooling makes allocation constant-time
- prefix sharing prevents beam/speculative blow-ups
- admission control keeps p99 stable
Once KV cache is predictable, all the other optimizations (batching, CUDA graphs, flash attention) finally pay off consistently.