A production LLM server at 100 requests/sec allocates and frees KV cache 100 times per second. Each allocation is variable-size (100-4096 tokens) and long-lived (10-60 seconds per request). Naive per-request allocation creates fragmentation: after an hour, 30% of your GPU memory is scattered free blocks too small to fit new requests, and your effective capacity drops from 80 GB to 56 GB. P99 latency spikes by 3-5x because new requests queue waiting for fragmentation to clear. A proper KV cache allocator — using fixed-size blocks, memory pooling, and prefix sharing — eliminates fragmentation and stabilizes P99 latency at the P50 level.

This post treats KV cache management as an allocator problem.

KV cache as an allocation workload

Each request grows by 1 token per decode step:

  • append K/V for every layer
  • shape per token per layer: 2Hdbytes2 \cdot H \cdot d \cdot bytes

Total bytes per token:

def kv_bytes_per_token(layers, heads, head_dim, dtype_bytes=2):
    return 2 * layers * heads * head_dim * dtype_bytes

Total bytes for a request of length LL: Bytes(L)=LBytesPerTokenBytes(L) = L \cdot BytesPerToken

Why naive contiguous allocation fragments

If you allocate a contiguous block sized for “max length” or allocate and reallocate growing buffers, you create holes of many sizes.

Classic symptom: “free memory is large, but no block is large enough.”

📊

Allocator behavior under mixed request lengths

StrategyFragmentationp99 latency riskImplementation cost
Contiguous per request High High Low
Page/block allocator Low Low Medium
Buddy allocator Medium Medium High

The fix: page/block allocation

Instead of “one big buffer per request,” allocate fixed-size pages (e.g., 16 tokens per page).

Then each request is a page list:

class PagePool:
    def __init__(self, total_pages, page_tokens=16):
        self.page_tokens = page_tokens
        self.free = list(range(total_pages))
        self.in_use = set()

    def alloc(self):
        if not self.free:
            return None
        p = self.free.pop()
        self.in_use.add(p)
        return p

    def free_page(self, p):
        if p in self.in_use:
            self.in_use.remove(p)
            self.free.append(p)

class RequestKV:
    def __init__(self):
        self.pages = []   # list[int]
        self.length = 0

    def ensure_capacity(self, pool, new_len):
        while len(self.pages) * pool.page_tokens < new_len:
            p = pool.alloc()
            if p is None:
                return False
            self.pages.append(p)
        return True

Fragmentation collapses because all free blocks are identical.

Page size is a real tuning knob

Small pages:

  • better packing for short requests
  • more page table overhead

Large pages:

  • lower overhead
  • worse internal fragmentation for short requests

Conceptual trade-off: page size

line
Metric 8163264
Overhead (relative)
1
0.6
0.4
0.3
Internal frag (relative)
0.4
0.6
0.8
1

Prefix sharing: beams and speculative decoding need it

Many requests share long prefixes (prompt + early tokens). Beam search shares prefixes until divergence.

If you copy KV pages per branch, you explode memory. Prefer copy-on-write semantics:

  • pages are reference-counted
  • only fork pages when a write occurs
class RefCountPool(PagePool):
    def __init__(self, total_pages, page_tokens=16):
        super().__init__(total_pages, page_tokens)
        self.refcnt = {}

    def incref(self, p):
        self.refcnt[p] = self.refcnt.get(p, 0) + 1

    def decref(self, p):
        self.refcnt[p] -= 1
        if self.refcnt[p] == 0:
            del self.refcnt[p]
            self.free_page(p)

p99 stability: admission control based on pages

Instead of “admit request if bytes fit,” admit if pages are available for its expected growth:

def can_admit(pool_free_pages, est_tokens, page_tokens=16):
    needed_pages = (est_tokens + page_tokens - 1) // page_tokens
    return pool_free_pages >= needed_pages

This makes p99 predictable because you never overcommit the KV cache.

💡 Serving is allocator math

Stable p99 is not achieved by “faster kernels” alone. It’s achieved by never forcing batch collapse due to allocator failure.

Illustrative results (mixed workload)

📊

KV allocator impact (example workload)

AllocatorUtilizationOOM/evictionsp99 latencyThroughput
Contiguous 65% Frequent 480 ms 1.0x
Paged pool 94% Rare 180 ms 1.8x
Paged + prefix share 96% Rare 160 ms 2.0x

Conclusion

Treat KV cache as an allocator problem:

  • fixed-size pages kill fragmentation
  • pooling makes allocation constant-time
  • prefix sharing prevents beam/speculative blow-ups
  • admission control keeps p99 stable

Once KV cache is predictable, all the other optimizations (batching, CUDA graphs, flash attention) finally pay off consistently.