Expert parallelism is the distribution strategy that makes MoE serving possible at scale, and the all-to-all collective is the operation that makes it expensive. In a DeepSeek V3 deployment with 256 experts across 64 GPUs, every single MoE layer requires every GPU to exchange tokens with every other GPU — twice. Once to dispatch tokens to the experts that will process them, once to combine the results back. The arithmetic is unforgiving: at the batch sizes and hidden dimensions used in production, a single forward pass generates 72 GB of cross-GPU traffic. This post is about exactly where those bytes go, how fast they get there, and the systems engineering that hides that latency.

1. The Expert Parallelism Communication Pattern

Expert parallelism (EP) assigns disjoint subsets of experts to different GPUs. With 256 experts and 64 GPUs, each GPU hosts 4 experts. When a token is routed to an expert that lives on a remote GPU, that token’s hidden state must be sent there. After the expert computes its output, the result must be sent back.

The communication primitive is all-to-all: each GPU sends a potentially different-sized payload to every other GPU, and receives a potentially different-sized payload from every other GPU. This is fundamentally different from all-reduce (where every GPU sends the same data) or all-gather (where every GPU receives the full dataset). All-to-all is an asymmetric, data-dependent operation — the volume each GPU sends to each destination depends on the router’s decisions for the current batch.

Quantifying the Volume

For a single MoE layer’s dispatch operation:

Vdispatch=B×k×D×bytes_per_elementV_{\text{dispatch}} = B \times k \times D \times \text{bytes\_per\_element}

where BB is the batch size (total tokens across all sequences), kk is the top-K experts per token, and DD is the hidden dimension. Each token is sent to kk experts, and the payload is the full hidden state vector.

For DeepSeek V3 parameters:

  • B=4096B = 4096 tokens (a typical serving batch)
  • k=8k = 8 (top-8 routing from 256 experts)
  • D=7168D = 7168 (hidden dimension)
  • FP16: 2 bytes per element

Vdispatch=4096×8×7168×2=469,762,048 bytes449 MBV_{\text{dispatch}} = 4096 \times 8 \times 7168 \times 2 = 469,762,048 \text{ bytes} \approx 449 \text{ MB}

The combine operation sends the expert outputs back to the originating GPUs. Same volume: 449 MB. Total per-layer communication:

Vlayer=2×Vdispatch898 MBV_{\text{layer}} = 2 \times V_{\text{dispatch}} \approx 898 \text{ MB}

DeepSeek V3 has 61 MoE layers (out of 80 total transformer layers — the first 3 are dense, then every layer after that alternates, but in practice roughly 61 layers use MoE routing). Across all MoE layers:

Vforward=61×898 MB53.5 GBV_{\text{forward}} = 61 \times 898 \text{ MB} \approx 53.5 \text{ GB}

For a full forward pass, 53.5 GB of hidden states cross GPU boundaries. During training (forward + backward), this doubles.

🚨 All-to-All Is Not Reducible

Unlike all-reduce, which has well-known bandwidth-optimal algorithms (ring, recursive halving-doubling) that achieve 2×N1N×MB2 \times \frac{N-1}{N} \times \frac{M}{B} runtime, all-to-all has no such reduction. Every byte must physically traverse the interconnect. You cannot compress or aggregate during transit. The only optimization levers are: overlap with computation, use faster interconnects, or reduce the volume (smaller hidden dim, fewer active experts, quantized dispatch).

The Communication Matrix

The all-to-all is not uniform. Different GPUs send different volumes to different destinations depending on which experts each token selects. Consider 64 GPUs, each holding 4 experts. The communication pattern forms a 64×6464 \times 64 matrix where entry (i,j)(i, j) is the number of bytes GPU ii sends to GPU jj:

Communication Matrix C[i][j] (bytes):
       GPU_0    GPU_1    GPU_2    ...   GPU_63
GPU_0  [local]  1.2 MB   0.8 MB  ...   0.6 MB
GPU_1  0.9 MB   [local]  1.4 MB  ...   0.7 MB
GPU_2  1.1 MB   0.5 MB   [local] ...   1.3 MB
...
GPU_63 0.7 MB   1.0 MB   0.9 MB  ...   [local]

The diagonal entries are “local” — tokens routed to experts on their home GPU require no communication. Off-diagonal entries vary based on routing decisions. A well-balanced router produces roughly uniform off-diagonal entries; a poorly balanced router creates hot spots where certain GPU pairs exchange disproportionately more data.

The total data each GPU ii sends is:

Si=jiC[i][j]S_i = \sum_{j \neq i} C[i][j]

And the total data GPU ii receives:

Ri=jiC[j][i]R_i = \sum_{j \neq i} C[j][i]

For the all-to-all to complete, every GPU must finish both sending and receiving. The GPU with the largest Si+RiS_i + R_i determines the wall-clock time. This is why load balancing at the router level has direct bandwidth implications — it is not just about compute balance but about communication balance.

📊

All-to-All Volume Per MoE Layer (DeepSeek V3 Config)

ParameterValueImpact
Batch size (B) 4096 tokens Linear scaling with batch
Top-K (k) 8 experts 8x replication of each token
Hidden dim (D) 7168 Dominates per-token payload
Dispatch volume 449 MB Per-layer, one direction
Combine volume 449 MB Per-layer, return direction
Total per layer 898 MB Dispatch + combine
Total forward pass (61 layers) ~53.5 GB All MoE layers combined
Note: FP16 precision. Actual volume varies with routing decisions. Local expert traffic excluded.

The all-to-all dispatch time depends entirely on the interconnect bandwidth between GPUs. Modern GPU clusters have two fundamentally different interconnect tiers, and the gap between them defines EP system design.

Within a single DGX H100 node (8 GPUs), NVLink 4.0 provides 900 GB/s bidirectional bandwidth between any pair of GPUs. The NVSwitch fabric makes this a full bisection bandwidth topology — all 8 GPUs can communicate simultaneously at full rate without contention.

Time to dispatch 449 MB over NVLink:

TNVLink=449 MB900 GB/s=0.4499000.5 msT_{\text{NVLink}} = \frac{449 \text{ MB}}{900 \text{ GB/s}} = \frac{0.449}{900} \approx 0.5 \text{ ms}

This is fast. At 0.5 ms per dispatch + 0.5 ms per combine = 1.0 ms per MoE layer for intra-node communication. Across 61 layers: 61 ms total. Manageable.

Inter-Node: InfiniBand RDMA

Between nodes, InfiniBand HDR provides 200 Gb/s (25 GB/s) per port. Even with 8 ports per node (200 GB/s aggregate), the effective per-GPU bandwidth to remote GPUs is far lower because multiple GPUs share the NIC bandwidth:

TRDMA=449 MB25 GB/s18 msT_{\text{RDMA}} = \frac{449 \text{ MB}}{25 \text{ GB/s}} \approx 18 \text{ ms}

That is a 36x slowdown compared to NVLink.

All-to-All Dispatch Latency by Interconnect (449 MB payload)

(ms)
NVLink 4.0 (H100) Intra-node
0.5 ms
NVLink 3.0 (A100) Intra-node
0.75 ms
+50.0%
IB HDR (200 Gb/s) Inter-node
18 ms
+3500.0%
IB NDR (400 Gb/s) Inter-node
9 ms
+1700.0%
RoCE (100 Gb/s) Ethernet RDMA
36 ms
+7100.0%

Why This Gap Defines EP Placement

The 36x gap means that expert placement on GPUs is not arbitrary. Experts that are frequently co-activated by the same tokens should reside on GPUs within the same node. If token tt selects experts e1e_1 and e2e_2, and both live on GPUs within the same NVLink domain, the dispatch for that token costs 0.5 ms. If e1e_1 is local and e2e_2 is on a remote node, the dispatch for e2e_2 costs 18 ms — and the combine must wait for the slowest expert to finish.

This motivates hierarchical EP: split the all-to-all into two phases:

  1. Intra-node all-to-all over NVLink: redistribute tokens among the 8 GPUs within each node. Fast.
  2. Inter-node all-to-all over RDMA: exchange tokens between nodes for experts that do not have a local copy. Slow.

The communication cost is dominated by the inter-node phase. If the router can be biased to prefer experts within the local node (while maintaining load balance), the inter-node traffic is reduced. DeepSeek V3 uses a device-limited routing auxiliary loss to encourage exactly this behavior:

Ldevice=αd=1D(fd1D)2\mathcal{L}_{\text{device}} = \alpha \sum_{d=1}^{D} \left( f_d - \frac{1}{D} \right)^2

where fdf_d is the fraction of tokens routed to experts on device dd, and DD is the number of devices. This loss penalizes routing distributions that send disproportionate traffic to any single device.

The NVLink Budget Rule

On an 8-GPU node with 4 experts per GPU and top-8 routing, the expected number of inter-node transfers per token is k×(1ElocalEtotal)=8×(132256)=7k \times (1 - \frac{E_{\text{local}}}{E_{\text{total}}}) = 8 \times (1 - \frac{32}{256}) = 7 in the worst case (completely uniform routing to all experts). If device-limited routing can bias 3 of the 8 experts to be local, inter-node transfers drop to 5 per token — a 29% reduction in the slowest communication path. The latency savings compound across 61 layers.

Bandwidth Utilization Reality

Theoretical peak bandwidth is rarely achieved. Protocol overhead (RDMA verb processing, completion queue polling), PCIe traversal between GPU HBM and the NIC, and network congestion all reduce effective throughput. Measured numbers on production clusters:

📊

Measured vs Theoretical Interconnect Bandwidth

InterconnectTheoretical PeakMeasured (All-to-All)Utilization
NVLink 4.0 (H100) 900 GB/s 810 GB/s 90%
NVLink 3.0 (A100) 600 GB/s 510 GB/s 85%
IB NDR (400 Gb/s) 50 GB/s 38 GB/s 76%
IB HDR (200 Gb/s) 25 GB/s 18 GB/s 72%
RoCE v2 (200 Gb/s) 25 GB/s 14 GB/s 56%
Note: All-to-all utilization is lower than all-reduce due to asymmetric traffic patterns and small per-destination message sizes.

All-to-all achieves lower utilization than all-reduce because the per-destination message sizes are smaller. With 64 GPUs, a 449 MB dispatch is split into 63 individual transfers of ~7.1 MB each. Small messages cannot saturate the link as effectively as one large transfer.

3. DeepEP: DeepSeek’s Expert Parallelism Communication Library

DeepEP is DeepSeek’s open-source communication library purpose-built for MoE expert parallelism. It addresses the core problem: how to execute all-to-all dispatch and combine operations as fast as possible, ideally overlapping them with computation so the communication cost is hidden.

Architecture: Two Kernel Types

DeepEP provides two distinct communication kernel families, optimized for different deployment scenarios:

High-throughput kernels target the prefill phase, where batch sizes are large and the priority is maximizing aggregate throughput. These kernels saturate both NVLink and RDMA simultaneously by running intra-node and inter-node transfers concurrently:

# DeepEP high-throughput dispatch (prefill path)
# Saturates NVLink + RDMA simultaneously
import deep_ep

buffer = deep_ep.Buffer(
    group=ep_group,           # Process group: 64 GPUs
    num_nvl_bytes=4 * 1024**3,  # 4 GB NVLink buffer
    num_rdma_bytes=2 * 1024**3  # 2 GB RDMA buffer
)

# Dispatch: tokens -> remote expert GPUs
# Returns recv_tokens on each GPU (tokens assigned to local experts)
recv_tokens, recv_metadata, dispatch_handle = buffer.dispatch(
    hidden_states,           # [num_local_tokens, hidden_dim] on this GPU
    topk_idx,               # [num_local_tokens, k] -- expert indices
    topk_weights,           # [num_local_tokens, k] -- gate weights
    num_experts=256,
    use_fp8=True            # Quantize dispatch payload to FP8
)

Measured performance: 153 GB/s NVLink throughput + 50 GB/s RDMA throughput running simultaneously on H800 nodes. The NVLink path uses GPU-initiated copies through NVSwitch; the RDMA path uses GPUDirect RDMA to bypass CPU staging entirely.

Low-latency kernels target the decode phase, where batch sizes are small (often just 1 token per sequence) and per-token latency is the binding constraint. These kernels use a fundamentally different approach:

# DeepEP low-latency dispatch (decode path)
# RDMA runs via DMA engines -- zero SM consumption
recv_tokens, recv_metadata, dispatch_handle = buffer.dispatch(
    hidden_states,
    topk_idx,
    topk_weights,
    num_experts=256,
    is_low_latency=True     # Use RDMA DMA path, free SM occupancy
)

The critical property: low-latency kernels execute RDMA transfers using the NIC’s DMA engines, consuming zero GPU streaming multiprocessors (SMs). The GPU’s tensor cores and SMs are completely unblocked during the transfer. This means expert FFN computation from the previous layer can run concurrently with the dispatch for the current layer, with no resource contention on the GPU.

ℹ️ Why SM-Free RDMA Matters

Standard NCCL all-to-all operations consume SMs for protocol processing, memory copies, and synchronization. On an H100 with 132 SMs, NCCL can consume 4-8 SMs for a large all-to-all — that is 3-6% of compute capacity unavailable for expert FFN execution. DeepEP’s low-latency kernels push all transfer logic to the NIC’s DMA engine and the NVSwitch hardware, returning 100% of SMs to compute. For decode workloads where expert FFN computation is already small (small batch size), those 4-8 SMs represent a proportionally larger fraction of total compute.

Hook-Based Overlap API

DeepEP implements communication-computation overlap through a hook-based API. The programmer registers communication operations that fire at specific points in the MoE forward pass, enabling the runtime to pipeline dispatch/combine with expert computation:

class DeepEPMoELayer(nn.Module):
    def __init__(self, num_experts, hidden_dim, ep_group):
        super().__init__()
        self.buffer = deep_ep.Buffer(
            group=ep_group,
            num_nvl_bytes=4 * 1024**3,
            num_rdma_bytes=2 * 1024**3
        )
        self.experts = nn.ModuleList([
            ExpertFFN(hidden_dim) for _ in range(num_experts // ep_group.size())
        ])

    def forward(self, hidden_states, router_logits):
        # Step 1: Router decision (runs on local GPU, no communication)
        topk_weights, topk_idx = router_logits.topk(k=8, dim=-1)
        topk_weights = F.softmax(topk_weights, dim=-1)

        # Step 2: Dispatch -- fire and register async handle
        recv_tokens, metadata, dispatch_handle = self.buffer.dispatch(
            hidden_states, topk_idx, topk_weights,
            num_experts=256
        )

        # Step 3: Expert computation on received tokens
        # This runs WHILE the dispatch for the NEXT layer is being prepared
        expert_outputs = self._run_local_experts(recv_tokens, metadata)

        # Step 4: Combine -- send expert outputs back to originating GPUs
        combined, combine_handle = self.buffer.combine(
            expert_outputs, dispatch_handle
        )

        # Step 5: Wait for combine to finish, then apply gate weights
        combine_handle.wait()
        output = combined * topk_weights.unsqueeze(-1)
        return output.sum(dim=1)  # Sum over top-k experts

The overlap strategy becomes clear when you consider two consecutive MoE layers:

Timeline for layers L and L+1:

Layer L:
  [Router L] -> [Dispatch L (async)] -> [Expert FFN L] -> [Combine L (async)]
                                                                |
Layer L+1:                                                      v
                 [Router L+1] -> [Dispatch L+1 (async)] -> [Expert FFN L+1] -> ...
                                  ^
                                  |
                 Dispatch L+1 communication overlaps with Expert FFN L compute

The dispatch for layer L+1L+1 starts as soon as the router decision is made, which depends only on the attention output — not on the MoE output of layer LL. In a standard transformer block (Attention -> MoE), the router for layer L+1L+1 runs during layer L+1L+1‘s attention, which can proceed while layer LL‘s combine is still in-flight. DeepEP’s async handles make this pipelining explicit.

📊

DeepEP Measured Throughput (H800 Cluster, 64 GPUs)

Kernel TypeNVLink BWRDMA BWSM UsageUse Case
High-throughput 153 GB/s 50 GB/s 8-12 SMs Prefill (large batch)
Low-latency 120 GB/s 42 GB/s 0 SMs Decode (small batch)
NCCL all-to-all (baseline) 130 GB/s 35 GB/s 4-8 SMs General purpose
Note: High-throughput kernels achieve simultaneous NVLink + RDMA saturation. Low-latency kernels sacrifice peak throughput for zero SM consumption.

FP8 Dispatch Quantization

DeepEP supports dispatching hidden states in FP8 instead of FP16/BF16. This halves the communication volume:

Vdispatch,FP8=4096×8×7168×1=224.5 MBV_{\text{dispatch,FP8}} = 4096 \times 8 \times 7168 \times 1 = 224.5 \text{ MB}

The quantization happens on the sending GPU right before the transfer, and dequantization happens on the receiving GPU right after. The expert FFN computation still runs in BF16/FP16. The quality impact is minimal because the dispatch payload is an intermediate hidden state (not weights or gradients), and the quantization error is absorbed by the expert’s nonlinear activation.

The latency savings are direct: half the bytes, half the transfer time. On RDMA, this cuts 18 ms to 9 ms per dispatch. Over 61 layers, that is 549 ms saved in the forward pass alone.

4. ScMoE: Achieving 100% Communication-Computation Overlap

Even with DeepEP’s async overlap, there remains a structural dependency: the MoE output for layer LL must be available before layer L+1L+1‘s computation can use it. The expert FFN computation can overlap with communication, but the final combine result is still on the critical path. ScMoE (Shortcut-connected Mixture of Experts) eliminates this dependency entirely.

The Structural Problem

In a standard MoE transformer block:

y=Attention(x)+MoE(Attention(x))y = \text{Attention}(x) + \text{MoE}(\text{Attention}(x))

The output yy depends on the MoE result. Layer L+1L+1 needs yy to compute its attention. Therefore, the all-to-all combine for layer LL must complete before layer L+1L+1‘s attention can begin. No amount of async overlap can hide a dependency on the critical path.

The exposed communication time is:

Texposed=max(0,TcombineToverlap_window)T_{\text{exposed}} = \max(0, T_{\text{combine}} - T_{\text{overlap\_window}})

where Toverlap_windowT_{\text{overlap\_window}} is the time available for overlap (typically the expert FFN computation time for the local experts). When Tcombine>Toverlap_windowT_{\text{combine}} > T_{\text{overlap\_window}} — which happens at small batch sizes in decode or with slow inter-node links — communication is on the critical path.

The ScMoE Solution: Skip Connection Bypass

ScMoE modifies the transformer block by adding a weighted skip connection that bypasses the MoE layer:

y=αMoE(x)+(1α)xy = \alpha \cdot \text{MoE}(x) + (1 - \alpha) \cdot x

where α\alpha is a learned scalar (per-layer, initialized to 0.5). The full block becomes:

h=Attention(LN(x))+xh = \text{Attention}(\text{LN}(x)) + x y=αMoE(LN(h))+(1α)hy = \alpha \cdot \text{MoE}(\text{LN}(h)) + (1 - \alpha) \cdot h

The skip path (1α)h(1 - \alpha) \cdot h requires no communication — it is a local scalar multiply. It is available immediately after attention completes. The MoE path αMoE(LN(h))\alpha \cdot \text{MoE}(\text{LN}(h)) requires the full dispatch/expert-compute/combine pipeline.

The key insight: the next layer’s attention can begin using the skip path output (1α)h(1 - \alpha) \cdot h while the MoE communication for the current layer is still in-flight. When the MoE result eventually arrives, it is added as a correction:

yapprox=(1α)h(available immediately)y_{\text{approx}} = (1 - \alpha) \cdot h \quad \text{(available immediately)} yfinal=yapprox+αMoE(LN(h))(available after combine)y_{\text{final}} = y_{\text{approx}} + \alpha \cdot \text{MoE}(\text{LN}(h)) \quad \text{(available after combine)}

Layer L+1L+1‘s attention starts with yapproxy_{\text{approx}} and the MoE correction is folded in via a deferred addition.

ScMoE Execution Timeline (Layer L)

Skip path provides immediate output. MoE result arrives as a correction term.

Attention (Layer L) Standard multi-head attention on input x. Produces h. Duration: ~2.5 ms (decode, B=256)
Router + Dispatch (async) Router computes top-K. Dispatch starts all-to-all transfer. Communication begins. ~0.1 ms for router, then async dispatch.
Skip Path: (1-a) * h Scalar multiply. Available IMMEDIATELY. No communication needed. Layer L+1 attention can start NOW using this approximate output.
Expert FFN (remote GPUs) Expert computation happens on remote GPUs. 1.2 ms compute. OVERLAPPED with Layer L+1 attention.
Combine (async) + Correction Expert results arrive. y_final = y_approx + alpha * MoE(h). Correction applied. Layer L complete.

The Overlap Timeline

Without ScMoE, the serial execution is:

Layer L:  [Attn 2.5ms] -> [Router 0.1ms] -> [Dispatch 0.5ms] -> [Expert 1.2ms]
          -> [Combine 0.5ms]
Layer L+1: .............. BLOCKED .......................................... [Attn 2.5ms] -> ...

Total: 4.8ms per layer
Exposed communication: 1.0ms (dispatch + combine)

With ScMoE:

Layer L:  [Attn 2.5ms] -> [Router 0.1ms] -> [Skip path: immediate]
                           |                        |
                           v                        v
                     [Dispatch 0.5ms]         Layer L+1: [Attn 2.5ms]
                           |                        |
                           v                        |
                     [Expert 1.2ms]                 |
                           |                        |
                           v                        v
                     [Combine 0.5ms]          [Correction arrives,
                           |                   added to output]
                           v
                     [Correction applied to L+1's input]

Total: 2.6ms per layer (attention-dominated)
Exposed communication: 0ms (fully hidden behind L+1 attention)

The communication is fully hidden as long as:

Tdispatch+Texpert+TcombineTattention,L+1T_{\text{dispatch}} + T_{\text{expert}} + T_{\text{combine}} \leq T_{\text{attention,L+1}}

For typical decode workloads: 0.5+1.2+0.5=2.2 ms2.5 ms0.5 + 1.2 + 0.5 = 2.2 \text{ ms} \leq 2.5 \text{ ms}. The condition is met. Communication cost drops to zero on the critical path.

Per-Layer Latency: Standard MoE vs ScMoE (Decode, B=256)

(ms)
Standard MoE (serial) 1.0ms comm exposed
4.8 ms
Standard MoE + DeepEP overlap ~0.3ms comm exposed
3.6 ms
ScMoE (full overlap) 0ms comm exposed
2.6 ms

Quality Impact

The skip connection introduces a bias: the next layer operates on an approximate input until the MoE correction arrives. The learned α\alpha controls this tradeoff. In practice:

  • α\alpha converges to 0.3—0.7 depending on the layer depth.
  • Early layers (which have less specialized experts) learn smaller α\alpha values, meaning the skip path dominates and MoE is a small correction.
  • Later layers learn larger α\alpha values, relying more heavily on expert outputs.

The quality degradation is measurable but small. On standard benchmarks:

📊

ScMoE Quality Impact (DeepSeek V3 Configuration, 256 Experts)

MetricStandard MoEScMoEDelta
MMLU (5-shot) 87.1 86.7 -0.4
HumanEval (pass@1) 65.2 64.8 -0.4
GSM8K 92.3 91.8 -0.5
End-to-end latency (decode) 4.8 ms/layer 2.6 ms/layer -46%
Throughput (tok/s, 64 GPUs) 12,400 18,100 +46%
Note: Quality loss under 1% across benchmarks. Throughput gain of 46% from eliminating communication from the critical path.

A sub-1% quality drop for 46% throughput improvement. For serving workloads where cost-per-token matters, this is a favorable tradeoff. The skip connection also provides gradient flow benefits during training (similar to ResNet’s original motivation), partially compensating for the approximation.

💡 When ScMoE Does Not Help

ScMoE’s overlap advantage disappears when communication is already fully hidden. On a single NVLink node (0.5 ms dispatch, 0.5 ms combine), standard DeepEP overlap already hides most communication behind expert compute. ScMoE’s value is highest in inter-node EP deployments where RDMA latency dominates: 18 ms dispatch times cannot be hidden behind 1.2 ms of expert FFN compute without architectural changes like skip connections.

5. Grouped GEMM: Efficient Expert Computation Under Load Imbalance

Once tokens arrive at their assigned GPUs after the all-to-all dispatch, each GPU must execute expert FFN computation for all received tokens. The challenge: different experts on the same GPU receive different numbers of tokens, and this number varies across batches.

The Problem

Each expert is an FFN with the same architecture (typically two linear layers with an activation in between). If expert e0e_0 on GPU gg receives 120 tokens and expert e1e_1 receives 45 tokens, executing them sequentially wastes time:

# Naive approach: sequential expert execution
for expert_id, expert in enumerate(local_experts):
    mask = (assigned_expert_ids == expert_id)
    tokens_for_expert = recv_tokens[mask]  # Variable size!
    output[mask] = expert(tokens_for_expert)
# Problem: 4 sequential kernel launches, each with different batch size.
# Small-batch experts underutilize GPU. Launch overhead adds up.

With 4 experts per GPU and variable token counts per expert, this produces 4 sequential GEMM kernel launches. Each launch has ~5-10 us overhead. More critically, small-batch GEMMs (e.g., 45 tokens with hidden dim 7168) cannot saturate the GPU’s tensor cores. The GPU is underutilized during small-expert execution.

Grouped GEMM Solution

Grouped GEMM executes multiple matrix multiplications with different batch dimensions in a single kernel launch. All experts’ computations are fused into one kernel that internally handles the varying sizes:

import torch
from cutlass import GroupedGemm

# Sort tokens by assigned expert
sorted_indices = torch.argsort(assigned_expert_ids)
sorted_tokens = recv_tokens[sorted_indices]

# Compute offsets for each expert's token range
expert_counts = torch.bincount(assigned_expert_ids, minlength=num_local_experts)
offsets = torch.cumsum(expert_counts, dim=0)
# offsets = [120, 165, 230, 280]  (cumulative token counts)

# All expert weight matrices stacked
# W1: [num_local_experts, hidden_dim, intermediate_dim]
# W2: [num_local_experts, intermediate_dim, hidden_dim]

# Single grouped GEMM call -- one kernel launch for all 4 experts
# Each "group" has a different M dimension (token count)
output = grouped_gemm(
    A=sorted_tokens,           # [total_tokens, hidden_dim]
    B=expert_weights_w1,       # [num_local_experts, hidden_dim, intermediate_dim]
    batch_sizes=expert_counts, # [120, 45, 65, 50]
    trans_b=True
)

The CUTLASS grouped GEMM kernel internally partitions the GPU’s thread blocks across the groups. Larger groups (more tokens) get more thread blocks. The kernel launch overhead is paid once instead of num_local_experts times.

CUTLASS Grouped GEMM API

The CUTLASS library provides the most optimized grouped GEMM implementation for NVIDIA GPUs:

// CUTLASS 3.x Grouped GEMM setup
using GemmKernel = cutlass::gemm::kernel::GemmGrouped<
    cutlass::half_t,                    // Element type A (tokens)
    cutlass::layout::RowMajor,          // Layout A
    cutlass::half_t,                    // Element type B (weights)
    cutlass::layout::ColumnMajor,       // Layout B
    cutlass::half_t,                    // Element type C (output)
    cutlass::layout::RowMajor,          // Layout C
    float,                              // Accumulator type
    cutlass::arch::OpClassTensorOp,     // Use tensor cores
    cutlass::arch::Sm90,                // H100 architecture
    cutlass::gemm::GemmShape<128,128,64>, // Tile shape per CTA
    cutlass::gemm::GemmShape<64,64,64>,   // Warp shape
    cutlass::gemm::GemmShape<16,8,16>     // MMA instruction shape
>;

// Problem sizes -- one per expert
std::vector<cutlass::gemm::GemmCoord> problem_sizes = {
    {120, 4096, 7168},  // Expert 0: 120 tokens, intermediate=4096, hidden=7168
    {45,  4096, 7168},  // Expert 1: 45 tokens
    {65,  4096, 7168},  // Expert 2: 65 tokens
    {50,  4096, 7168},  // Expert 3: 50 tokens
};

// Launch single grouped kernel
GemmKernel::Arguments args(
    problem_sizes.data(),
    problem_sizes.size(),
    ptr_A, lda_A,            // Token matrices per group
    ptr_B, lda_B,            // Weight matrices per group
    ptr_C, lda_C,            // Output matrices per group
    ptr_D, lda_D,            // Bias (optional)
    {1.0f, 0.0f}             // alpha, beta
);
GemmKernel grouped_gemm;
grouped_gemm.initialize(args);
grouped_gemm.run(stream);

Performance Comparison

📊

Expert FFN Execution: Sequential vs Grouped GEMM

ApproachKernel LaunchesLatency (4 experts)GPU Util
Sequential GEMM 8 (2 per expert) 1.85 ms 62%
Grouped GEMM (CUTLASS) 2 (W1 + W2) 1.20 ms 89%
Grouped GEMM + FP8 2 0.72 ms 91%
Note: Measured on H100 SXM5, 4 local experts, token counts [120, 45, 65, 50], hidden_dim=7168, intermediate=4096.

The grouped GEMM achieves 35% lower latency than sequential execution by eliminating kernel launch overhead and improving occupancy. The GPU utilization jumps from 62% to 89% because larger effective batch sizes better saturate the tensor cores. Adding FP8 computation (with FP32 accumulation) further cuts latency to 0.72 ms — a 61% reduction from the sequential baseline.

⚠️ Load Imbalance Limits Grouped GEMM

Grouped GEMM improves efficiency but cannot eliminate the fundamental load imbalance problem. If one expert receives 300 tokens and another receives 10, the 300-token group dominates the kernel’s runtime. The 10-token group finishes in a fraction of the time but must wait for the 300-token group. This is why router load balancing (auxiliary loss, capacity factors) directly impacts expert compute efficiency — not just communication balance.

6. Implementer’s Exercise: EP-Aware MoE Forward Pass

Putting it all together: here is a complete pseudocode implementation of an EP-aware MoE forward pass that integrates routing, all-to-all dispatch/combine, and grouped expert execution. This is the code path that runs once per MoE layer, 61 times per forward pass.

import torch
import torch.distributed as dist
from torch import Tensor
from typing import Tuple

class EPMoELayer(torch.nn.Module):
    """
    Expert-parallel MoE layer.

    Assumptions:
    - ep_group: process group spanning all EP ranks (e.g., 64 GPUs)
    - Each GPU hosts (num_experts // ep_size) local experts
    - Router produces top-k expert selections per token
    """

    def __init__(
        self,
        hidden_dim: int = 7168,
        intermediate_dim: int = 4096,
        num_experts: int = 256,
        top_k: int = 8,
        ep_group: dist.ProcessGroup = None,
    ):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_experts = num_experts
        self.top_k = top_k
        self.ep_group = ep_group
        self.ep_size = dist.get_world_size(ep_group)  # 64
        self.ep_rank = dist.get_rank(ep_group)
        self.experts_per_gpu = num_experts // self.ep_size  # 4

        # Router: linear projection to expert logits
        self.router = torch.nn.Linear(hidden_dim, num_experts, bias=False)

        # Local expert FFN weights (4 experts per GPU)
        self.w1 = torch.nn.Parameter(
            torch.randn(self.experts_per_gpu, hidden_dim, intermediate_dim)
        )
        self.w2 = torch.nn.Parameter(
            torch.randn(self.experts_per_gpu, intermediate_dim, hidden_dim)
        )

    def forward(self, x: Tensor) -> Tensor:
        """
        Args:
            x: [num_tokens, hidden_dim] -- tokens on this GPU

        Returns:
            output: [num_tokens, hidden_dim] -- MoE output
        """
        num_tokens = x.shape[0]

        # === Step 1: Router Decision (local, no communication) ===
        router_logits = self.router(x)          # [num_tokens, 256]
        topk_weights, topk_ids = torch.topk(
            router_logits, self.top_k, dim=-1
        )                                        # [num_tokens, 8]
        topk_weights = torch.softmax(topk_weights, dim=-1)

        # === Step 2: Prepare dispatch metadata ===
        # For each token-expert pair, compute destination GPU
        dest_gpus = topk_ids // self.experts_per_gpu  # [num_tokens, 8]
        local_expert_ids = topk_ids % self.experts_per_gpu

        # Count tokens going to each GPU (for all-to-all sizing)
        send_counts = torch.zeros(
            self.ep_size, dtype=torch.int64, device=x.device
        )
        for gpu_id in range(self.ep_size):
            send_counts[gpu_id] = (dest_gpus == gpu_id).sum()

        # Exchange counts so each GPU knows incoming volume
        recv_counts = torch.zeros_like(send_counts)
        dist.all_to_all_single(
            recv_counts, send_counts, group=self.ep_group
        )
        # recv_counts[i] = number of token-expert pairs arriving from GPU i

        # === Step 3: All-to-All Dispatch ===
        # Sort tokens by destination GPU for contiguous sends
        flat_dest = dest_gpus.reshape(-1)         # [num_tokens * 8]
        flat_tokens = x.repeat_interleave(self.top_k, dim=0)  # replicate
        sort_idx = torch.argsort(flat_dest)
        sorted_tokens = flat_tokens[sort_idx]     # contiguous per dest

        # Compute send/recv splits (cumulative token counts per GPU)
        send_splits = send_counts.tolist()
        recv_splits = recv_counts.tolist()

        # All-to-all: dispatch hidden states to expert GPUs
        total_recv = sum(recv_splits)
        recv_buffer = torch.empty(
            total_recv, self.hidden_dim,
            dtype=x.dtype, device=x.device
        )

        # The core all-to-all operation
        # Each GPU sends send_splits[i] tokens to GPU i
        # Each GPU receives recv_splits[i] tokens from GPU i
        dist.all_to_all(
            list(recv_buffer.split(recv_splits)),
            list(sorted_tokens.split(send_splits)),
            group=self.ep_group
        )
        # recv_buffer: [total_recv, hidden_dim] -- tokens for local experts

        # === Step 4: Grouped Expert FFN Computation ===
        # Sort received tokens by local expert ID for grouped GEMM
        # (metadata about which local expert each token targets
        #  is exchanged alongside the hidden states)
        recv_expert_ids = self._get_recv_expert_ids(recv_splits)
        expert_sort = torch.argsort(recv_expert_ids)
        sorted_recv = recv_buffer[expert_sort]

        expert_counts = torch.bincount(
            recv_expert_ids, minlength=self.experts_per_gpu
        )

        # Grouped GEMM: W1 (up-projection) for all experts in one launch
        # sorted_recv: [total_recv, hidden_dim]
        # w1: [experts_per_gpu, hidden_dim, intermediate_dim]
        intermediate = self._grouped_gemm(
            sorted_recv, self.w1, expert_counts
        )
        intermediate = torch.nn.functional.silu(intermediate)

        # Grouped GEMM: W2 (down-projection)
        expert_output = self._grouped_gemm(
            intermediate, self.w2, expert_counts
        )

        # Unsort back to per-source-GPU ordering
        unsort_idx = torch.argsort(expert_sort)
        expert_output = expert_output[unsort_idx]

        # === Step 5: All-to-All Combine ===
        # Send expert outputs back to originating GPUs
        send_buffer = torch.empty(
            num_tokens * self.top_k, self.hidden_dim,
            dtype=x.dtype, device=x.device
        )
        dist.all_to_all(
            list(send_buffer.split(send_splits)),  # recv into original order
            list(expert_output.split(recv_splits)), # send from expert order
            group=self.ep_group
        )

        # === Step 6: Unsort and apply gate weights ===
        unsorted_output = torch.empty_like(send_buffer)
        unsorted_output[sort_idx] = send_buffer
        # Reshape to [num_tokens, top_k, hidden_dim]
        per_expert_out = unsorted_output.view(
            num_tokens, self.top_k, self.hidden_dim
        )
        # Weighted sum: gate weights * expert outputs
        output = (per_expert_out * topk_weights.unsqueeze(-1)).sum(dim=1)

        return output  # [num_tokens, hidden_dim]

    def _grouped_gemm(
        self, tokens: Tensor, weights: Tensor, counts: Tensor
    ) -> Tensor:
        """
        Execute grouped GEMM for all local experts.

        In production, this calls CUTLASS grouped GEMM kernel.
        Here shown as a loop for clarity.
        """
        outputs = []
        offset = 0
        for i, count in enumerate(counts.tolist()):
            if count > 0:
                expert_tokens = tokens[offset:offset + count]
                out = expert_tokens @ weights[i]
                outputs.append(out)
            offset += count
        return torch.cat(outputs, dim=0) if outputs else tokens.new_empty(0, weights.shape[-1])

    def _get_recv_expert_ids(self, recv_splits):
        """Reconstruct local expert IDs for received tokens."""
        # In practice, this metadata is piggybacked on the all-to-all
        # Simplified: assume uniform distribution for illustration
        ids = []
        for split_size in recv_splits:
            ids.append(
                torch.randint(0, self.experts_per_gpu, (split_size,))
            )
        return torch.cat(ids).to(self.w1.device)

Profiling the Critical Path

When you profile this code with torch.profiler, the critical path becomes visible:

EP MoE Layer Critical Path Breakdown

Sequential execution timeline for one MoE layer. Communication dominates.

Router (Linear + TopK + Softmax) 4096 x 256 matmul + top-8 selection. Purely local. ~0.08 ms on H100
Dispatch Metadata Exchange (all-to-all counts) 64 x int64 values exchanged. Tiny payload, high relative latency. ~0.05 ms (latency-dominated)
Token Sort + All-to-All Dispatch Sort by dest GPU: ~0.02ms. All-to-all: 449 MB across 64 GPUs. ~0.5 ms NVLink / ~18 ms RDMA
Expert FFN (Grouped GEMM x2 + SiLU) Two grouped GEMMs for up-proj and down-proj. Plus activation. ~1.2 ms (H100, 4 local experts, ~70 tokens/expert avg)
All-to-All Combine Send expert outputs back. Same volume as dispatch. ~0.5 ms NVLink / ~18 ms RDMA
Unsort + Weighted Sum Restore original token order. Apply gate weights. Sum over top-K. ~0.05 ms

Total per-layer cost:

  • Intra-node only (NVLink): 0.08 + 0.05 + 0.52 + 1.20 + 0.52 + 0.05 = 2.42 ms
  • Inter-node (RDMA): 0.08 + 0.05 + 18.02 + 1.20 + 18.02 + 0.05 = 37.42 ms

The inter-node case is 15.5x slower, and 96% of that time is communication. This is precisely why DeepEP’s overlap and ScMoE’s skip connections exist — the expert compute is only 1.2 ms out of a 37.4 ms layer.

Making It Async

Converting the synchronous implementation above to use DeepEP’s async API requires minimal changes:

# Replace synchronous dist.all_to_all with DeepEP async dispatch
recv_buffer, metadata, handle = self.buffer.dispatch(
    x, topk_ids, topk_weights,
    num_experts=256,
    async_op=True  # Non-blocking
)

# Expert FFN from PREVIOUS layer runs here (if pipelined)
prev_layer_expert_output = self.prev_layer_experts(prev_recv_buffer)

# Wait for current dispatch to complete
handle.wait()

# Now run current layer's expert FFN
expert_output = self.run_local_experts(recv_buffer, metadata)

# Async combine
combined, combine_handle = self.buffer.combine(
    expert_output, metadata,
    async_op=True  # Non-blocking
)

# Next layer's attention can start here (if using ScMoE skip)

The pattern is: fire async dispatch, do useful work (previous layer’s expert FFN or next layer’s attention via ScMoE), then wait for the result. The goal is to push Texposed0T_{\text{exposed}} \rightarrow 0.

End-to-End Forward Pass Latency (61 MoE Layers, 64 GPUs)

(ms)
Sync (NVLink only) 2.42ms x 61
148 ms
Sync (RDMA) 37.42ms x 61
2,283 ms
+1442.6%
DeepEP overlap (NVLink) ~22% saved
115 ms
DeepEP overlap (RDMA) ~48% saved
1,195 ms
+707.4%
ScMoE + DeepEP (RDMA) ~90% saved
220 ms
+48.6%
The Optimization Stack

The full optimization stack for EP communication is layered: (1) FP8 dispatch quantization halves the byte volume. (2) Hierarchical all-to-all separates fast NVLink transfers from slow RDMA transfers. (3) DeepEP async kernels overlap communication with expert computation. (4) ScMoE skip connections overlap communication with the next layer’s attention. Each technique compounds with the others. Applied together, they reduce exposed communication time by 90%+ even in inter-node deployments.

Summary

Expert parallelism is the only viable distribution strategy for models with hundreds of experts, but its communication cost is severe. The all-to-all dispatch and combine operations move 898 MB per MoE layer. Over 61 layers, that is 53.5 GB per forward pass. The 36x bandwidth gap between NVLink (900 GB/s) and InfiniBand RDMA (25 GB/s) means that expert placement and hierarchical communication are not optional — they are load-bearing design decisions.

DeepEP provides the systems infrastructure: high-throughput kernels that saturate NVLink and RDMA simultaneously at 153 + 50 GB/s, and low-latency kernels that execute RDMA without consuming any GPU SMs. Its hook-based API enables pipelining dispatch and combine operations with expert computation.

ScMoE goes further by eliminating the structural dependency between MoE output and the next layer’s computation. The skip connection provides an immediate approximate output, allowing the next layer’s attention to proceed while expert results are still in transit. The result is 100% communication-computation overlap at a sub-1% quality cost.

Grouped GEMM handles the final piece: efficiently executing variable-batch expert FFNs in a single kernel launch, improving GPU utilization from 62% to 89%.

The implementer’s exercise showed the complete data flow: router decision, all-to-all dispatch, sorted expert computation, all-to-all combine, and weighted aggregation. The critical path analysis makes it clear why communication optimization dominates MoE system design — expert computation is 1.2 ms, but inter-node communication is 36 ms. Without overlap, 96% of time is spent moving bytes, not computing on them.