Part of Series NVIDIA Dynamo & llm-d 3 of 30
1 NVIDIA Dynamo: KV-Aware Routing and the Inference Operating System for GPU Clusters 2 NVIDIA Dynamo Part 2: ModelExpress, NIXL, and Zero-Instruction Cold Starts 3 NVIDIA Dynamo Part 3: The Planner, Grove Operator, and Gang Scheduling on NVL72 4 NVIDIA Dynamo Part 4: KVBM — Multi-Tier KV Cache Offloading Across GPU, CPU, SSD, and Remote 5 llm-d: Declarative Inference Configuration — From YAML to Optimized GPU Execution 6 Dynamo Fault Tolerance: Canary Health Checks, Request Migration, and Graceful Degradation 7 Dynamo Multi-Model Serving: GPU Sharing, Model Priority, and Adapter Pool Management 8 Dynamo for Multimodal: Video/Audio Routing and Encoder Scheduling 9 Dynamo Cost Optimizer: Spot Instances, Reserved Capacity, and Burst Strategy 10 Dynamo on Blackwell: GB200 NVL72 Architecture and Inference Integration 11 Dynamo Observability: Distributed Tracing, Metrics, and Latency Alerting 12 Dynamo vs SGLang Router: Architectural Comparison and Integration Patterns 13 Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration 14 Building a Mini-Dynamo: A 500-Line Python KV-Aware Router 15 Dynamo Request Lifecycle: End-to-End Trace from HTTP to GPU Kernel with Latency Breakdown 16 Dynamo Capacity Planning: How Many GPUs for Your SLO, Traffic Pattern, and Model Size 17 Migrating from Single-Node vLLM to Dynamo: A Step-by-Step Production Guide 18 Dynamo Security and Isolation: Multi-Tenant Serving, Request Isolation, and Data Privacy 19 Dynamo A/B Testing and Canary Deployments: Safe Model Updates Without Downtime 20 Dynamo Production Monitoring: Grafana Dashboards, Alert Playbooks, and On-Call Guide 21 Dynamo Network Optimization: InfiniBand Tuning, NCCL Parameters, and Cross-Rack Communication 22 Dynamo for Edge: Extending Cluster Orchestration to On-Premise and Hybrid Deployments 23 Dynamo Batch Inference: Offline Processing and Maximum Throughput 24 Dynamo Speculative Decoding: Draft-Target Coordination Across a Cluster 25 Dynamo Model Versioning: Blue-Green Deployment and Safe Rollback 26 Dynamo GPU Health: DCGM Integration and Predictive Maintenance 27 Load Testing Dynamo: Finding Your Cluster's Breaking Point 28 Dynamo Multi-Tenant Isolation: Ensuring Data Privacy Across Shared GPU Clusters 29 Dynamo Cost-Per-Token Optimization: Minimizing Serving Cost While Meeting SLOs 30 Dynamo Roadmap: What's Coming in 2026 — CXL Integration, NVLink Switch, and Beyond

Deploying a tensor-parallel model across 8 GPUs is easy when you have 8 GPUs in one server. But what happens when you’re managing 1,000 GPUs spread across 125 servers in 5 racks? Kubernetes will happily scatter your 8-GPU TP group across 8 different nodes connected by InfiniBand, giving you 36x slower all-reduce compared to GPUs in the same NVLink domain. Worse: when you scale up to 10,000 GPUs, the scheduling problem becomes coordinating hundreds of TP groups across a topology where not all GPU pairs have equal connectivity. Two GPUs in the same tray share 1.8 TB/s NVLink. Two GPUs across NVSwitch domains drop to 600 GB/s. Two GPUs across InfiniBand drop to 50 GB/s. Dynamo’s Planner and the Grove Kubernetes operator solve this by modeling the GPU topology as a graph, computing placement costs for every candidate GPU group, and implementing gang scheduling that guarantees your TP group lands on well-connected hardware.

A GB200 NVL72 rack contains 72 Blackwell GPUs connected by NVLink 5.0. Not all GPU pairs have equal connectivity: adjacent GPUs share direct NVLink at 1.8 TB/s, while GPUs across NVSwitch domains have lower effective bandwidth (~600 GB/s). Placing a tensor-parallel group on poorly-connected GPUs can degrade all-reduce latency by 3x.

Dynamo’s Planner and the Grove Kubernetes operator solve this: they understand the GPU topology and make placement decisions that respect connectivity constraints. This post covers the topology model, the cost function, and the gang scheduling algorithm.

The NVL72 Topology

GB200 NVL72 Connectivity Hierarchy

Intra-Tray (2 GPUs) Direct NVLink 5.0 1,800 GB/s bidirectional, 0.5 us latency
Intra-Rack via NVSwitch Up to 72 GPUs ~600 GB/s effective cross-domain, 1-2 us
Inter-Rack (InfiniBand NDR) Across racks 50 GB/s, 5-10 us latency
Cross-Datacenter (WAN) Remote clusters 1-10 GB/s, 1-50 ms latency

The Planner maintains a topology graph where each GPU is a node and each link has measured bandwidth and latency:

class TopologyGraph:
    def __init__(self):
        self.gpus = {}           # gpu_id -> GPUNode
        self.links = {}          # (gpu_a, gpu_b) -> LinkInfo
        self.domains = {}        # domain_id -> set of gpu_ids

    def add_gpu(self, gpu_id, rack_id, tray_id, domain_id):
        self.gpus[gpu_id] = GPUNode(gpu_id, rack_id, tray_id, domain_id)
        self.domains.setdefault(domain_id, set()).add(gpu_id)

    def add_link(self, gpu_a, gpu_b, bandwidth_gbps, latency_us):
        self.links[(gpu_a, gpu_b)] = LinkInfo(bandwidth_gbps, latency_us)
        self.links[(gpu_b, gpu_a)] = LinkInfo(bandwidth_gbps, latency_us)

    def all_reduce_cost(self, gpu_group):
        """Estimate all-reduce latency for a TP group on these GPUs."""
        # Ring all-reduce: message passes through (N-1) links
        # Bottleneck is the slowest link in the ring
        min_bw = float("inf")
        max_latency = 0
        for i in range(len(gpu_group)):
            j = (i + 1) % len(gpu_group)
            link = self.links.get((gpu_group[i], gpu_group[j]))
            if link is None:
                return float("inf")  # No direct connection
            min_bw = min(min_bw, link.bandwidth_gbps)
            max_latency = max(max_latency, link.latency_us)

        # All-reduce volume: 2 * (N-1)/N * message_size
        # Time: volume / bandwidth + N * latency
        return min_bw, max_latency

The Planner’s Cost Function

For each incoming request or model deployment, the Planner computes a placement cost:

\text{Cost}(M, G) = w_1 \cdot T_{\text{allreduce}}(G) + w_2 \cdot T_{\text{kv\_transfer}}(G) + w_3 \cdot Q_{\text{depth}}(G) + w_4 \cdot (1 - \text{cache\_hit}}(G))

Where:

  • MM = model configuration (TP degree, PP stages)
  • GG = candidate GPU group
  • Tallreduce(G)T_{\text{allreduce}}(G) = estimated all-reduce time based on GPU connectivity
  • Tkv_transfer(G)T_{\text{kv\_transfer}}(G) = estimated KV cache transfer time from prefill to decode GPUs
  • Qdepth(G)Q_{\text{depth}}(G) = current queue depth on candidate GPUs
  • cache_hit(G)\text{cache\_hit}(G) = fraction of KV cache already present on candidate GPUs

The Planner minimizes this cost across all feasible GPU groups.

📊

Placement Cost Examples (Llama 70B, TP=8)

GPU GroupAll-Reduce BWAll-Reduce Time (2MB msg)Queue DepthTotal Cost
8 GPUs, same tray 1,800 GB/s 0.016 ms 2 requests Low
8 GPUs, same rack/domain 600 GB/s 0.048 ms 5 requests Medium
8 GPUs, cross-rack (IB) 50 GB/s 0.576 ms 1 request High (despite low queue)
Note: All-reduce time dominates the cost function for TP workloads. Cross-rack TP is 36x slower than same-tray TP.
⚠️ Cross-Rack TP Is Almost Never Worth It

Tensor parallelism requires 2 all-reduce operations per transformer layer (forward and backward paths for attention and FFN). At 80 layers, that is 160 all-reduces per forward pass. Cross-rack adds 0.56ms per all-reduce = 89.6ms total overhead. Same-tray adds only 2.56ms total. The Planner avoids cross-rack TP placements unless no intra-rack options are available.

Gang Scheduling

A tensor-parallel group MUST be scheduled atomically: all N GPUs start the model simultaneously, or none do. This is gang scheduling.

def gang_schedule(model_config, topology, available_gpus):
    """Find the best GPU group for a TP model deployment."""
    tp_degree = model_config.tp_degree  # e.g., 8

    # Generate candidate groups: prefer same-domain GPUs
    candidates = []
    for domain_id, domain_gpus in topology.domains.items():
        free_gpus = [g for g in domain_gpus if g in available_gpus]
        if len(free_gpus) >= tp_degree:
            # All combinations of tp_degree GPUs from this domain
            from itertools import combinations
            for group in combinations(free_gpus, tp_degree):
                cost = compute_placement_cost(model_config, group, topology)
                candidates.append((cost, group))

    if not candidates:
        # Fall back to cross-domain placement
        all_free = list(available_gpus)
        if len(all_free) >= tp_degree:
            for group in combinations(all_free, tp_degree):
                cost = compute_placement_cost(model_config, group, topology)
                candidates.append((cost, group))

    if not candidates:
        return None  # Cannot schedule

    # Select lowest-cost group
    candidates.sort(key=lambda x: x[0])
    best_cost, best_group = candidates[0]

    # Atomically reserve all GPUs in the group
    for gpu in best_group:
        available_gpus.remove(gpu)

    return best_group

The Grove Kubernetes Operator

Grove translates Dynamo’s scheduling decisions into Kubernetes pod specifications. It creates a custom resource definition (CRD) for inference deployments:

apiVersion: dynamo.nvidia.com/v1
kind: InferenceDeployment
metadata:
  name: llama-70b-serving
spec:
  model:
    name: meta-llama/Llama-3-70B
    tensorParallelism: 8
    pipelineParallelism: 1
  placement:
    affinityPolicy: "same-nvswitch-domain"
    antiAffinityPolicy: "spread-across-racks"
  scaling:
    minReplicas: 2
    maxReplicas: 16
    targetLatency:
      ttft_p99_ms: 500
      tbt_p99_ms: 50
    scaleUpTrigger:
      queueDepth: 10
    scaleDownDelay: 300s
  coldStart:
    strategy: "modelexpress"  # Use GPU-to-GPU streaming
    sourceReplica: "any-healthy"

Grove reads the topology from NVIDIA’s nv-topology-exporter DaemonSet, which reports NVLink connectivity and bandwidth for every GPU pair. It then uses the Planner’s cost function to translate affinityPolicy: "same-nvswitch-domain" into concrete pod-to-node assignments.

ℹ️ Why Kubernetes Alone Is Not Enough

Standard Kubernetes scheduling assigns pods to nodes based on CPU/memory availability. It has no concept of NVLink topology, GPU interconnect bandwidth, or KV cache locality. Grove extends the scheduler with GPU-topology-aware placement, ensuring TP groups land on well-connected GPUs. Without Grove, a TP=8 deployment might spread across 4 nodes with InfiniBand links, degrading performance 36x.