Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns

Part of Series NVIDIA Dynamo & llm-d 12 of 30

1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

SGLang achieves 93% prefix cache hit rates on single-node deployments by indexing all cached KV blocks in a radix tree—when two requests share a 5000-token system prompt, the second request reuses those cached blocks instead of recomputing them. But SGLang has no cross-node awareness: if you run 10 SGLang workers, each maintains its own isolated cache. Dynamo solves this with cluster-level KV routing: when a new request arrives, the router picks the worker with the highest cache overlap. Combining both gets you 93% single-node hit rate AND optimal cross-node placement. This post shows how to integrate SGLang workers under Dynamo orchestration.

Architectural Overview

SGLang: Single-Node Optimizer

SGLang’s core innovation is RadixAttention — a radix tree data structure that indexes all cached KV data by token prefix:

# SGLang RadixTree conceptual structure
class RadixTreeNode:
    """Node in the radix tree. Each node represents a cached prefix."""
    def __init__(self):
        self.children = {}         # token_id -> child RadixTreeNode
        self.kv_cache_block_id = None  # Physical block containing KV data
        self.ref_count = 0          # Number of sequences using this prefix
        self.last_access_time = 0   # For eviction ordering

class RadixTree:
    """Radix tree for KV cache prefix sharing."""
    def __init__(self):
        self.root = RadixTreeNode()

    def insert(self, token_ids, block_ids):
        """Insert a sequence of tokens with their KV cache blocks."""
        node = self.root
        for i, token in enumerate(token_ids):
            if token not in node.children:
                node.children[token] = RadixTreeNode()
            node = node.children[token]
            node.kv_cache_block_id = block_ids[i // BLOCK_SIZE]
            node.ref_count += 1

    def match_prefix(self, token_ids):
        """Find the longest cached prefix for a token sequence."""
        node = self.root
        matched = 0
        for token in token_ids:
            if token in node.children:
                node = node.children[token]
                matched += 1
            else:
                break
        return matched, node

When a new request arrives, SGLang walks the radix tree to find the longest cached prefix. If a previous request shared the same system prompt + few-shot examples, the KV cache for those tokens is already computed and can be reused:

def schedule_request(request, radix_tree):
    """Schedule a request using RadixAttention."""
    # Find cached prefix
    matched_len, node = radix_tree.match_prefix(request.prompt_tokens)

    # Only prefill the uncached suffix
    uncached_tokens = request.prompt_tokens[matched_len:]

    return {
        'cached_prefix_len': matched_len,
        'uncached_tokens': uncached_tokens,
        'prefill_savings': matched_len / len(request.prompt_tokens),
    }

Dynamo: Multi-Node Orchestrator

Dynamo operates above the inference engine layer. It does not manage individual KV cache blocks — it tracks which nodes have which prefixes cached and routes accordingly:

# Dynamo Router conceptual structure
class DynamoRouter:
    """Cluster-level KV-aware router."""
    def __init__(self, workers):
        self.workers = workers
        # Maps: prefix_hash -> set of worker_ids that have it cached
        self.cache_index = {}

    def route(self, request):
        """Route request to the worker with best KV cache overlap."""
        prompt_hash = hash_prefix(request.prompt_tokens)

        best_worker = None
        best_score = -1

        for worker in self.workers:
            # Score = KV cache overlap - queue penalty
            overlap = self._compute_overlap(prompt_hash, worker)
            queue_penalty = worker.queue_depth * worker.avg_step_time
            score = overlap - queue_penalty

            if score > best_score:
                best_score = score
                best_worker = worker

        return best_worker

    def _compute_overlap(self, prompt_hash, worker):
        """Estimate KV cache overlap between request and worker."""
        # Worker reports its cached prefix hashes periodically
        worker_hashes = self.cache_index.get(worker.id, set())
        if prompt_hash in worker_hashes:
            return 1.0  # Full prefix cached
        # Check partial prefix matches
        for prefix_len in range(len(prompt_hash) - 1, 0, -1):
            partial_hash = prompt_hash[:prefix_len]
            if partial_hash in worker_hashes:
                return prefix_len / len(prompt_hash)
        return 0.0

Key Differences

📊

Architectural Comparison: SGLang vs Dynamo

Aspect	SGLang	Dynamo
Scope	Single node	Multi-node cluster
Cache granularity	Per-block (16 tokens)	Per-prefix-hash (coarse)
Cache sharing	Radix tree (exact match)	Hash-based (prefix match)
Routing scope	Within-GPU scheduling	Across-node routing
Prefill/decode split	No (unified)	Yes (disaggregated)
Language	Python + CUDA	Rust + Python
KV transfer	In-GPU-memory (zero copy)	GPU-to-GPU (NIXL/RDMA)
Cold start	Model load from disk	NIXL streaming from peer
Autoscaling	No	Planner + Grove autoscaler

RadixAttention Deep Dive

Consider three requests arriving at an SGLang server:

Request 1: [System prompt (200 tok)] + [User: "Summarize this code" (50 tok)]
Request 2: [System prompt (200 tok)] + [User: "Explain this error" (40 tok)]
Request 3: [System prompt (200 tok)] + [Few-shot examples (500 tok)] + [User: "Classify" (20 tok)]

The radix tree after all three requests:

Root
 |-- [System prompt, 200 tok] --> KV blocks 0-12
      |-- [User: "Summarize...", 50 tok] --> KV blocks 13-16 (Request 1)
      |-- [User: "Explain...", 40 tok] --> KV blocks 13-15 (Request 2)
      |-- [Few-shot examples, 500 tok] --> KV blocks 13-43 (Request 3)
           |-- [User: "Classify", 20 tok] --> KV blocks 44-45 (Request 3)

Requests 1 and 2 share the system prompt KV cache (blocks 0-12). Request 3 shares the system prompt and has its own few-shot branch. Without RadixAttention, each request would independently compute the system prompt — 3x the prefill work.

The savings compound with scale:

Prefix Cache Hit Rate vs. System Prompt Sharing (100 RPS)

(% hit rate)

No sharing

0 % hit rate

1 system prompt All share same prefix

85 % hit rate

5 system prompts

72 % hit rate

50 system prompts

35 % hit rate

Random prompts No common prefixes

5 % hit rate

Cache Eviction

When GPU memory is full, SGLang evicts the least-recently-used leaf nodes from the radix tree:

class RadixTreeEvictor:
    """LRU eviction from the radix tree."""

    def evict(self, tree, num_blocks_to_free):
        """Evict leaf nodes to free KV cache blocks."""
        freed = 0
        # Collect all leaf nodes with their last access time
        leaves = self._collect_leaves(tree.root)
        leaves.sort(key=lambda n: n.last_access_time)  # Oldest first

        for leaf in leaves:
            if freed >= num_blocks_to_free:
                break
            if leaf.ref_count == 0:  # Only evict unreferenced nodes
                blocks_freed = self._remove_leaf(tree, leaf)
                freed += blocks_freed

        return freed

    def _collect_leaves(self, node, depth=0):
        """Collect all leaf nodes."""
        if not node.children:
            return [node]
        leaves = []
        for child in node.children.values():
            leaves.extend(self._collect_leaves(child, depth + 1))
        return leaves

    def _remove_leaf(self, tree, leaf):
        """Remove a leaf node and free its KV blocks."""
        block_id = leaf.kv_cache_block_id
        # Walk up and remove if parent has no other children
        # (prune empty branches)
        return 1  # Freed one block

Dynamo KV-Aware Routing Deep Dive

The Cache Index

Dynamo maintains a cluster-wide index of which workers have which prefix hashes cached:

class ClusterCacheIndex:
    """Cluster-wide index of KV cache prefix locations."""

    def __init__(self):
        # prefix_hash -> {worker_id: cached_token_count}
        self.index = {}
        self.lock = threading.Lock()

    def update_worker_cache(self, worker_id, cached_prefixes):
        """Worker reports its cached prefixes."""
        with self.lock:
            # Remove old entries for this worker
            for prefix_hash in list(self.index.keys()):
                if worker_id in self.index[prefix_hash]:
                    del self.index[prefix_hash][worker_id]
                    if not self.index[prefix_hash]:
                        del self.index[prefix_hash]

            # Add new entries
            for prefix_hash, token_count in cached_prefixes.items():
                if prefix_hash not in self.index:
                    self.index[prefix_hash] = {}
                self.index[prefix_hash][worker_id] = token_count

    def find_best_worker(self, prompt_hashes, workers):
        """Find the worker with best cache overlap for this prompt."""
        scores = {w.id: 0 for w in workers}

        with self.lock:
            for prefix_hash in prompt_hashes:
                if prefix_hash in self.index:
                    for worker_id, cached_count in self.index[prefix_hash].items():
                        if worker_id in scores:
                            scores[worker_id] += cached_count

        best_id = max(scores, key=scores.get)
        return next(w for w in workers if w.id == best_id), scores[best_id]

Routing Staleness

The cache index is updated periodically (every 1-5 seconds), not in real-time. This means routing decisions are based on slightly stale information:

class StalenessAwareRouter:
    """Route with awareness of cache index staleness."""

    def __init__(self, cache_index, update_interval=2.0):
        self.cache_index = cache_index
        self.update_interval = update_interval
        self.last_update = {}  # worker_id -> timestamp

    def route(self, request, workers):
        # Compute cache overlap scores
        scores = {}
        for worker in workers:
            overlap = self.cache_index.find_overlap(request, worker)
            staleness = time.time() - self.last_update.get(worker.id, 0)

            # Discount overlap score by staleness
            # After 5 seconds, cache state could have changed significantly
            freshness = max(0, 1.0 - staleness / (self.update_interval * 3))
            scores[worker] = overlap * freshness + (1 - freshness) * 0.5

        # Also factor in queue depth (always fresh, reported per-request)
        for worker in workers:
            queue_cost = worker.queue_depth * 0.01  # Penalty per queued request
            scores[worker] -= queue_cost

        return max(scores, key=scores.get)

ℹ️ Staleness Is Acceptable

A 2-second staleness window means the router might route a request to a worker that evicted the relevant cache 1 second ago. The cost of this miss is one redundant prefill (50-500ms). Compared to the alternative (real-time cache synchronization across the cluster, which would add network overhead to every cache operation), periodic updates with staleness tolerance is the right tradeoff.

Where Each System Excels

When most requests share common prefixes (same system prompt, similar few-shot patterns) and the workload fits on a single node:

def simulate_sglang_advantage():
    """Scenario where SGLang's fine-grained sharing excels."""
    # 100 requests, all sharing a 4K system prompt
    system_prompt_tokens = 4096
    user_query_tokens = 200  # Average

    # SGLang: prefill 4096 tokens once, then 200 per request
    sglang_prefill_total = system_prompt_tokens + 100 * user_query_tokens
    # = 4096 + 20000 = 24,096 tokens

    # Without prefix caching: prefill everything per request
    naive_prefill_total = 100 * (system_prompt_tokens + user_query_tokens)
    # = 100 * 4296 = 429,600 tokens

    savings = 1 - sglang_prefill_total / naive_prefill_total
    # = 94.4% prefill savings

    return {
        'sglang_prefill_tokens': sglang_prefill_total,
        'naive_prefill_tokens': naive_prefill_total,
        'savings_percent': savings * 100,
    }

Dynamo Wins: Multi-Node, Heterogeneous Traffic

When the workload spans multiple nodes with diverse traffic patterns:

def simulate_dynamo_advantage():
    """Scenario where Dynamo's cluster routing excels."""
    # 4 nodes, each with 80GB KV cache
    # 20 different system prompts, each requiring 1GB of KV cache
    # Without Dynamo: each node stores all 20 prompts
    #   = 20 * 1GB = 20GB per node (25% of KV cache for system prompts)

    # With Dynamo: route by system prompt, each node handles 5 prompts
    #   = 5 * 1GB = 5GB per node (6.25% of KV cache)
    # More KV cache available for user-specific data

    nodes = 4
    prompts = 20
    prompt_kv_size_gb = 1

    without_dynamo_kv_per_node = prompts * prompt_kv_size_gb
    with_dynamo_kv_per_node = (prompts / nodes) * prompt_kv_size_gb

    kv_savings_gb = without_dynamo_kv_per_node - with_dynamo_kv_per_node

    return {
        'without_dynamo_kv_overhead_gb': without_dynamo_kv_per_node,
        'with_dynamo_kv_overhead_gb': with_dynamo_kv_per_node,
        'kv_savings_gb_per_node': kv_savings_gb,
        'additional_sequences_per_node': int(kv_savings_gb * 1024 / 5.24),
        # ~2860 more sequences per node at 5.24 MB/block for Llama 70B
    }

📊

Performance Comparison by Scenario

Scenario	SGLang Only	Dynamo Only	SGLang + Dynamo
Single node, shared prefix	1.0x (baseline)	0.9x (routing overhead)	1.0x
4 nodes, shared prefix	1.0x per node	1.8x (cache routing)	2.0x
4 nodes, diverse prompts	1.0x per node	2.5x (locality routing)	2.8x
16 nodes, bursty traffic	1.0x per node	3.5x (load balancing)	4.0x
Prefill/decode disagg.	N/A (not supported)	2.0x (pipeline)	2.2x

Note: Speedup is in aggregate tokens/sec vs. a single SGLang node without optimization.

Integration Architecture

Dynamo + SGLang: The Best of Both

The optimal architecture uses Dynamo as the cluster orchestrator and SGLang as the per-node inference engine:

class IntegratedSystem:
    """Dynamo cluster routing with SGLang per-node engines."""

    def __init__(self, num_nodes):
        # Dynamo components
        self.router = DynamoRouter()
        self.planner = DynamoPlanner()

        # SGLang instances (one per node)
        self.sglang_engines = {}
        for node_id in range(num_nodes):
            engine = SGLangEngine(
                model="meta-llama/Llama-3-70B",
                tp_size=8,
                enable_radix_attention=True,
            )
            self.sglang_engines[node_id] = engine
            self.router.register_worker(node_id, engine)

        # Cache synchronization: SGLang reports cache state to Dynamo
        self._start_cache_sync()

    def handle_request(self, request):
        """Route via Dynamo, execute via SGLang."""
        # Step 1: Dynamo routes to the best node
        target_node = self.router.route(request)

        # Step 2: SGLang on that node handles execution
        # SGLang's RadixAttention provides fine-grained prefix sharing
        engine = self.sglang_engines[target_node]
        result = engine.generate(request)

        return result

    def _start_cache_sync(self):
        """Periodically sync SGLang cache state to Dynamo router."""
        def sync_loop():
            while True:
                for node_id, engine in self.sglang_engines.items():
                    # SGLang reports its cached prefix hashes
                    cached_prefixes = engine.get_cached_prefix_hashes()
                    self.router.cache_index.update_worker_cache(
                        node_id, cached_prefixes
                    )
                time.sleep(2)  # Sync every 2 seconds

        import threading
        threading.Thread(target=sync_loop, daemon=True).start()

The Cache Sync Protocol

SGLang must expose its internal radix tree state to Dynamo. This is done via a lightweight export:

class SGLangCacheExporter:
    """Export SGLang's RadixAttention cache state for Dynamo."""

    def __init__(self, sglang_engine):
        self.engine = sglang_engine

    def export_cache_hashes(self):
        """Export cached prefix hashes and their sizes."""
        radix_tree = self.engine.get_radix_tree()
        exports = {}

        def walk(node, prefix_tokens):
            if node.kv_cache_block_id is not None:
                prefix_hash = hash_tokens(prefix_tokens)
                exports[prefix_hash] = len(prefix_tokens)

            for token_id, child in node.children.items():
                walk(child, prefix_tokens + [token_id])

        walk(radix_tree.root, [])

        # Limit exports to top-100 most accessed prefixes
        # (to keep the sync payload small)
        sorted_exports = sorted(
            exports.items(),
            key=lambda x: x[1],  # By prefix length (longer = more valuable)
            reverse=True,
        )[:100]

        return dict(sorted_exports)

Routing Feedback Loop

When Dynamo routes a request to node A because of a cache hit, but the request actually gets a cache miss (because the cache was evicted between sync updates), the system should learn:

class RoutingFeedback:
    """Track routing decisions and actual cache hit outcomes."""

    def __init__(self):
        self.decisions = []
        self.hits = 0
        self.misses = 0

    def record_decision(self, request_id, target_node, expected_overlap):
        self.decisions.append({
            'request_id': request_id,
            'target_node': target_node,
            'expected_overlap': expected_overlap,
            'timestamp': time.time(),
        })

    def record_outcome(self, request_id, actual_overlap):
        for d in reversed(self.decisions):
            if d['request_id'] == request_id:
                if actual_overlap >= d['expected_overlap'] * 0.8:
                    self.hits += 1
                else:
                    self.misses += 1
                d['actual_overlap'] = actual_overlap
                break

    @property
    def routing_accuracy(self):
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0

    def get_sync_interval_recommendation(self):
        """Recommend cache sync interval based on routing accuracy."""
        accuracy = self.routing_accuracy
        if accuracy > 0.95:
            return 5.0  # Slow sync is fine
        elif accuracy > 0.85:
            return 2.0  # Default
        elif accuracy > 0.70:
            return 1.0  # Need fresher data
        else:
            return 0.5  # Very aggressive sync

Performance Comparison Benchmarks

Benchmark Setup

class RouterBenchmark:
    """Benchmark routing strategies."""

    def __init__(self, num_nodes, num_prompts, total_requests):
        self.num_nodes = num_nodes
        self.num_prompts = num_prompts
        self.total_requests = total_requests

    def generate_workload(self):
        """Generate a realistic workload."""
        import random
        requests = []
        # Zipf distribution: some prompts are much more common
        prompt_weights = [1.0 / (i + 1) for i in range(self.num_prompts)]
        total_weight = sum(prompt_weights)
        prompt_probs = [w / total_weight for w in prompt_weights]

        for _ in range(self.total_requests):
            prompt_id = random.choices(range(self.num_prompts), weights=prompt_probs)[0]
            requests.append({
                'prompt_id': prompt_id,
                'prompt_length': random.randint(1000, 4000),
                'output_length': random.randint(100, 500),
            })
        return requests

    def simulate_round_robin(self, requests):
        """Baseline: round-robin routing."""
        node_caches = [set() for _ in range(self.num_nodes)]
        cache_hits = 0
        cache_misses = 0

        for i, req in enumerate(requests):
            node = i % self.num_nodes
            if req['prompt_id'] in node_caches[node]:
                cache_hits += 1
            else:
                cache_misses += 1
                node_caches[node].add(req['prompt_id'])

        return cache_hits / len(requests)

    def simulate_dynamo_routing(self, requests):
        """Dynamo: KV-aware routing."""
        node_caches = [set() for _ in range(self.num_nodes)]
        cache_hits = 0

        for req in requests:
            # Find node with this prompt cached
            best_node = None
            for n in range(self.num_nodes):
                if req['prompt_id'] in node_caches[n]:
                    best_node = n
                    break

            if best_node is not None:
                cache_hits += 1
            else:
                # Route to least-loaded node
                best_node = min(range(self.num_nodes), key=lambda n: len(node_caches[n]))
                node_caches[best_node].add(req['prompt_id'])

        return cache_hits / len(requests)

Cache Hit Rate: Round-Robin vs KV-Aware Routing (4 nodes, 50 prompts, 10K requests)

(% cache hit rate)

Round-robin

23 % cache hit rate

Least-connections

31 % cache hit rate

Dynamo KV-aware

78 % cache hit rate

SGLang (per-node) Finer granularity

85 % cache hit rate

Dynamo + SGLang Best of both

92 % cache hit rate

Integration Implementation

class DynamoSGLangIntegration:
    """
    Complete integration of Dynamo routing with SGLang engines.
    """

    def __init__(self, config):
        # Initialize SGLang engines
        self.engines = {}
        for node in config.nodes:
            engine = self._start_sglang(node)
            self.engines[node.id] = engine

        # Initialize Dynamo router
        self.cache_index = ClusterCacheIndex()
        self.router = StalenessAwareRouter(self.cache_index)
        self.feedback = RoutingFeedback()

        # Start background tasks
        self._start_sync()
        self._start_metrics()

    def generate(self, request):
        """Handle a generation request."""
        # Hash the prompt for routing
        prompt_hashes = hash_prefix_chain(
            request.prompt_tokens, block_size=16
        )

        # Route via Dynamo
        workers = list(self.engines.keys())
        target_node, expected_overlap = self.cache_index.find_best_worker(
            prompt_hashes,
            [type('W', (), {'id': w}) for w in workers],
        )

        self.feedback.record_decision(
            request.id, target_node, expected_overlap
        )

        # Execute via SGLang
        engine = self.engines[target_node]
        result = engine.generate(
            prompt=request.prompt_tokens,
            sampling_params=request.sampling_params,
        )

        # Record outcome
        actual_overlap = result.get('prefix_cache_hit_tokens', 0) / len(request.prompt_tokens)
        self.feedback.record_outcome(request.id, actual_overlap)

        return result

    def _start_sglang(self, node_config):
        """Start an SGLang engine on a node."""
        return SGLangEngine(
            model=node_config.model_name,
            tp_size=node_config.tp_size,
            port=node_config.port,
        )

    def _start_sync(self):
        """Start cache state synchronization."""
        def sync():
            while True:
                interval = self.feedback.get_sync_interval_recommendation()
                for node_id, engine in self.engines.items():
                    try:
                        cached = engine.get_cached_prefix_hashes()
                        self.cache_index.update_worker_cache(node_id, cached)
                    except Exception as e:
                        print(f"Cache sync failed for node {node_id}: {e}")
                time.sleep(interval)

        import threading
        threading.Thread(target=sync, daemon=True).start()

    def _start_metrics(self):
        """Collect and report integration metrics."""
        def metrics():
            while True:
                accuracy = self.feedback.routing_accuracy
                print(f"Routing accuracy: {accuracy:.2%}")
                time.sleep(30)

        import threading
        threading.Thread(target=metrics, daemon=True).start()

    def get_status(self):
        return {
            'num_engines': len(self.engines),
            'routing_accuracy': self.feedback.routing_accuracy,
            'cache_index_size': len(self.cache_index.index),
        }

When to Use What

def recommend_architecture(workload):
    """Recommend SGLang, Dynamo, or both based on workload."""
    if workload.num_nodes == 1:
        return {
            'recommendation': 'SGLang only',
            'reason': 'Single node -- Dynamo adds overhead without benefit',
        }

    if workload.num_nodes > 1 and workload.prefix_diversity < 10:
        return {
            'recommendation': 'Dynamo + SGLang',
            'reason': (
                'Multi-node with shared prefixes -- '
                'Dynamo routes by prefix, SGLang shares within node'
            ),
        }

    if workload.num_nodes > 4 and workload.needs_disaggregation:
        return {
            'recommendation': 'Dynamo + SGLang',
            'reason': (
                'Large cluster with disaggregated prefill/decode -- '
                'Dynamo manages pools, SGLang runs per-node'
            ),
        }

    if workload.prefix_diversity > 100:
        return {
            'recommendation': 'Dynamo + simple engine',
            'reason': (
                'High prefix diversity -- RadixAttention hit rate is low, '
                'Dynamo routing provides more value'
            ),
        }

    return {
        'recommendation': 'Dynamo + SGLang',
        'reason': 'Default: both systems provide complementary benefits',
    }