Sparse Upcycling: Converting Dense Models to MoE Without Retraining from Scratch

Part of Series Frontier Research 2025-2026 28 of 30

1 Reasoning Scaling Laws: How Inference-Time Compute Changes Everything We Know About Scaling 2 Lightning Attention: Implementing Linear-Time Attention for Million-Token Contexts 3 Policy of Thoughts: Test-Time Policy Evolution and Online Reasoning Refinement 4 Test-Time Compute Scaling: When a 1B Model Beats a 405B Model 5 Self-Improving Systems: Models That Generate Their Own Training Data 6 Embodied AI Foundations: World Models, Physical Reasoning, and the Sora/V-JEPA Connection 7 Reward Model Engineering: ORM vs PRM, Verifier Design, and Why Reward Quality Determines Reasoning Quality 8 Constitutional AI and RLHF Alternatives: DPO, KTO, ORPO, and the Post-Training Revolution 9 Long-Context Research: Architectures and Techniques for 1M to 10M+ Token Windows 10 Multimodal Fusion: Early vs Late Fusion, Cross-Attention, and Interleaved Architectures 11 Efficient Fine-Tuning: LoRA, DoRA, QLoRA, GaLore, and LISA — When to Use Each 12 The Research Frontier in 2026: Open Problems and Promising Directions 13 Hallucination Mitigation: Detection, Prevention, and Why LLMs Confidently Produce Nonsense 14 Mechanistic Interpretability: Sparse Autoencoders, Feature Circuits, and Understanding What's Inside 15 GRPO Complete Algorithm: Group Relative Policy Optimization for Reasoning Models 16 Synthetic Reasoning Data: STaR, ReST, and How Models Bootstrap Their Own Training Signal 17 Alignment at Scale: Scalable Oversight, Weak-to-Strong Generalization, and Constitutional AI 18 Agent Architectures: ReAct, Tool Use, Multi-Step Planning, and Memory Systems for LLM Agents 19 Continual Learning and Catastrophic Forgetting: Why Models Lose Old Knowledge When Learning New 20 Multimodal Generation: Text-to-Image, Text-to-Video, and Unified Generation Architectures 21 Model Evaluation Beyond Benchmarks: Arena, Human Preference, and Capability Elicitation 22 Scaling Laws Complete: Kaplan, Chinchilla, Inference-Time, and the Multi-Dimensional Frontier 23 World Models: Predicting Future States from Actions for Planning and Simulation 24 Tool Use and Function Calling: How LLMs Learn to Use APIs, Calculators, and Code Interpreters 25 Safety and Red Teaming: Adversarial Attacks, Jailbreaks, and Defense Mechanisms 26 Knowledge Editing: ROME, MEMIT, and Surgically Modifying What LLMs Know 27 Chain-of-Thought Internals: What Happens Inside the Model During Reasoning 28 Sparse Upcycling: Converting Dense Models to MoE Without Retraining from Scratch 29 Data-Efficient Training: Learning More from Less with Curriculum, Filtering, and Replay 30 The Open Source LLM Ecosystem in 2026: HuggingFace, Ollama, and the Tools That Changed Everything

Training Mixtral 8x7B from scratch costs $12M in compute. Sparse upcycling a pretrained Mistral 7B to 8x7B costs$ 2M — an 83% saving. The trick: instead of initializing expert weights randomly, duplicate the pretrained dense MLP 8 times and add a router network. The experts start identical but diverge during continued training as the router learns specialization. The result: upcycled models reach 95-98% of native MoE quality at 15-20% of the training cost. The gap narrows with more upcycling FLOPs, but even at equal training budgets, native MoE still wins by 2-3 points on most benchmarks.

This post covers the complete sparse upcycling pipeline: weight initialization strategies, router training, load balancing, and the tradeoffs between upcycled and natively-trained MoE models.

Dense-to-MoE Conversion

Architecture Transformation

import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import copy

@dataclass
class UpcyclingConfig:
    """Configuration for sparse upcycling."""
    n_experts: int = 8
    top_k: int = 2
    router_type: str = "learned"  # "learned" or "hash"
    init_strategy: str = "duplicate_noise"
    noise_std: float = 0.01
    load_balance_loss_weight: float = 0.01
    router_z_loss_weight: float = 0.001
    continued_training_tokens: int = 50_000_000_000
    learning_rate: float = 2e-5

class SparseUpcycler:
    """
    Convert a pretrained dense transformer to MoE.

    Transformation steps:
    1. For each transformer layer's MLP:
       a. Duplicate the MLP weights N times (N = n_experts)
       b. Add noise to break symmetry
       c. Add a router network (linear layer)
    2. Keep attention layers unchanged
    3. Keep embedding and output layers unchanged
    4. Continue training with MoE forward pass
    """

    def __init__(self, config):
        self.config = config

    def convert_layer(self, dense_mlp_weights):
        """
        Convert a single dense MLP to an expert-routed MoE layer.

        dense_mlp_weights: dict with keys
          'gate_proj': [hidden_dim, intermediate_dim]
          'up_proj':   [hidden_dim, intermediate_dim]
          'down_proj': [intermediate_dim, hidden_dim]
        """
        n_experts = self.config.n_experts
        hidden_dim = dense_mlp_weights["gate_proj"].shape[0]

        # Initialize experts
        expert_weights = []
        for i in range(n_experts):
            expert = self._init_expert(
                dense_mlp_weights, i
            )
            expert_weights.append(expert)

        # Initialize router
        router_weights = self._init_router(hidden_dim)

        return {
            "experts": expert_weights,
            "router": router_weights,
            "n_experts": n_experts,
            "top_k": self.config.top_k,
        }

    def _init_expert(self, dense_weights, expert_idx):
        """
        Initialize expert weights from dense MLP.

        Strategies:
        1. duplicate_noise: copy dense weights + small noise
           (most common, works well in practice)
        2. subset_init: each expert gets a different subset
           of the dense weights
        3. random_permute: permute neurons differently
           for each expert
        """
        strategy = self.config.init_strategy

        if strategy == "duplicate_noise":
            return self._duplicate_noise_init(
                dense_weights, expert_idx
            )
        elif strategy == "subset_init":
            return self._subset_init(
                dense_weights, expert_idx
            )
        elif strategy == "random_permute":
            return self._permute_init(
                dense_weights, expert_idx
            )
        else:
            return self._duplicate_noise_init(
                dense_weights, expert_idx
            )

    def _duplicate_noise_init(self, dense_weights,
                                expert_idx):
        """
        Copy dense weights and add Gaussian noise.

        The noise breaks symmetry so experts can
        specialize during continued training. Without
        noise, all experts would remain identical and
        the router could not learn to differentiate.
        """
        expert = {}
        for key, weight in dense_weights.items():
            noise = np.random.randn(
                *weight.shape
            ) * self.config.noise_std

            # Scale noise relative to weight magnitude
            noise *= np.std(weight)

            expert[key] = weight + noise

        return expert

    def _subset_init(self, dense_weights, expert_idx):
        """
        Each expert gets a subset of neurons.

        For N experts, expert i gets neurons
        [i*d/N : (i+1)*d/N] from the intermediate dimension.
        Remaining neurons are zero-initialized.

        This forces immediate specialization but starts
        with lower capacity per expert.
        """
        expert = {}
        n_experts = self.config.n_experts

        for key, weight in dense_weights.items():
            new_weight = np.zeros_like(weight)

            if weight.ndim == 2:
                dim = weight.shape[1]
                start = expert_idx * dim // n_experts
                end = (expert_idx + 1) * dim // n_experts
                new_weight[:, start:end] = weight[:, start:end]
            else:
                new_weight = weight.copy()

            expert[key] = new_weight

        return expert

    def _permute_init(self, dense_weights, expert_idx):
        """
        Randomly permute neurons for each expert.

        Maintains the same weight magnitudes but
        shuffles which neurons respond to which inputs.
        """
        expert = {}
        np.random.seed(expert_idx * 42)

        for key, weight in dense_weights.items():
            if weight.ndim == 2:
                perm = np.random.permutation(weight.shape[1])
                expert[key] = weight[:, perm]
            else:
                expert[key] = weight.copy()

        return expert

    def _init_router(self, hidden_dim):
        """
        Initialize the router network.

        The router is a linear layer: hidden_dim -> n_experts
        Output is passed through softmax to get routing
        probabilities. Top-k experts are selected.

        Initialization: small random weights (zero-centered)
        so initial routing is approximately uniform.
        """
        n_experts = self.config.n_experts

        router_weight = np.random.randn(
            hidden_dim, n_experts
        ) * 0.01

        return {
            "weight": router_weight,
            "bias": np.zeros(n_experts),
        }

ℹ️ Note

The noise magnitude in duplicate_noise initialization is critical. Too little noise (std = 0.001) means experts stay identical for many training steps, wasting compute. Too much noise (std = 0.1) destroys the pretrained representations and negates the benefit of upcycling. Empirically, noise std of 0.01 (1% of weight std) produces the best balance between symmetry breaking and knowledge preservation.

Router Training

Learning to Route During Continued Training

class MoERouter:
    """
    Router network for MoE forward pass.

    The router takes the hidden state h and produces
    routing probabilities for each expert:
      logits = h @ W_router + b_router
      probs = softmax(logits)
      selected = top_k(probs, k)

    Training challenges:
    1. Load balancing: without constraints, the router
       routes all tokens to 1-2 experts
    2. Expert collapse: some experts get no tokens
       and their weights stop updating
    3. Token dropping: when experts are full, tokens
       are dropped (lose information)
    """

    def __init__(self, hidden_dim, n_experts, top_k=2):
        self.hidden_dim = hidden_dim
        self.n_experts = n_experts
        self.top_k = top_k

        self.weight = np.random.randn(
            hidden_dim, n_experts
        ) * 0.01
        self.bias = np.zeros(n_experts)

    def route(self, hidden_states):
        """
        Route tokens to experts.

        hidden_states: [batch_size, seq_len, hidden_dim]
        Returns: routing weights and expert assignments
        """
        batch_size, seq_len, _ = hidden_states.shape
        flat = hidden_states.reshape(-1, self.hidden_dim)

        # Compute routing logits
        logits = flat @ self.weight + self.bias

        # Softmax
        probs = self._softmax(logits)

        # Top-k selection
        top_k_indices = np.argsort(
            probs, axis=-1
        )[:, -self.top_k:]

        # Routing weights (renormalized)
        top_k_weights = np.zeros_like(probs)
        for i in range(len(flat)):
            for j in top_k_indices[i]:
                top_k_weights[i, j] = probs[i, j]

            # Renormalize
            weight_sum = top_k_weights[i].sum()
            if weight_sum > 0:
                top_k_weights[i] /= weight_sum

        return {
            "indices": top_k_indices,
            "weights": top_k_weights,
            "logits": logits,
            "probs": probs,
        }

    def compute_load_balance_loss(self, routing_result):
        """
        Load balancing auxiliary loss (Switch Transformer).

        L_balance = N * sum_i(f_i * P_i)

        where f_i = fraction of tokens routed to expert i
              P_i = average routing probability for expert i

        This loss encourages uniform token distribution
        across experts.
        """
        probs = routing_result["probs"]
        indices = routing_result["indices"]

        n_tokens = len(probs)

        # f_i: fraction of tokens routed to each expert
        expert_counts = np.zeros(self.n_experts)
        for token_experts in indices:
            for expert in token_experts:
                expert_counts[expert] += 1
        f = expert_counts / (n_tokens * self.top_k)

        # P_i: average probability for each expert
        P = np.mean(probs, axis=0)

        # Loss
        loss = self.n_experts * np.sum(f * P)

        return loss

    def compute_router_z_loss(self, routing_result):
        """
        Router z-loss (ST-MoE).

        L_z = (1/N) * sum_i log(sum_j exp(logits_ij))^2

        Penalizes large logit magnitudes, preventing
        the router from becoming too confident and
        routing all tokens to a single expert.
        """
        logits = routing_result["logits"]

        # Log-sum-exp of logits
        lse = np.log(np.sum(np.exp(logits), axis=-1))

        # Mean squared
        loss = np.mean(lse ** 2)

        return loss

    def _softmax(self, logits):
        """Compute softmax."""
        exp = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
        return exp / exp.sum(axis=-1, keepdims=True)

📊

Expert Initialization Strategy Comparison

Strategy	Tokens to Recover Dense Perf	Final MoE Quality (vs native)	Expert Specialization Speed	Risk of Expert Collapse
Duplicate + noise (0.01)	10B tokens	97%	Medium	Low
Duplicate + noise (0.001)	25B tokens	96%	Slow	Medium
Subset init	5B tokens	94%	Immediate	High (dead experts)
Random permute	15B tokens	95%	Medium	Low
Random init (no upcycling)	200B+ tokens	100%	N/A	High initially

Continued Training

The Upcycling Training Loop

class UpcyclingTrainer:
    """
    Continue training after dense-to-MoE conversion.

    Training schedule:
    1. Phase 1 (warmup): low LR, high load-balance loss
       - Router learns to distribute tokens
       - Experts begin to diverge from copies
    2. Phase 2 (specialization): normal LR, moderate balance loss
       - Experts specialize on different token types
       - Quality surpasses original dense model
    3. Phase 3 (refinement): decayed LR, low balance loss
       - Fine-tune expert specializations
       - Quality approaches native MoE

    Key insight: the first 1-5B tokens primarily train
    the router. The experts already have good representations
    from the dense pretraining. The router must learn
    which expert to use for which tokens.
    """

    def __init__(self, model, config):
        self.model = model
        self.config = config
        self.current_step = 0

    def get_training_schedule(self, total_tokens):
        """
        Generate the three-phase training schedule.
        """
        return [
            {
                "phase": "warmup",
                "tokens": int(total_tokens * 0.05),
                "lr": self.config.learning_rate * 0.1,
                "balance_weight": self.config.load_balance_loss_weight * 5,
                "z_loss_weight": self.config.router_z_loss_weight * 2,
            },
            {
                "phase": "specialization",
                "tokens": int(total_tokens * 0.70),
                "lr": self.config.learning_rate,
                "balance_weight": self.config.load_balance_loss_weight,
                "z_loss_weight": self.config.router_z_loss_weight,
            },
            {
                "phase": "refinement",
                "tokens": int(total_tokens * 0.25),
                "lr": self.config.learning_rate * 0.1,
                "balance_weight": self.config.load_balance_loss_weight * 0.5,
                "z_loss_weight": self.config.router_z_loss_weight * 0.5,
            },
        ]

    def compute_total_loss(self, batch, routing_results):
        """
        Compute total loss including auxiliary losses.

        L_total = L_language + alpha * L_balance + beta * L_z

        where L_language is the standard cross-entropy language
        modeling loss, L_balance is the load balancing loss,
        and L_z is the router z-loss.
        """
        # Language modeling loss (placeholder)
        lm_loss = 0.0

        # Auxiliary losses
        balance_loss = sum(
            r.compute_load_balance_loss()
            for r in routing_results
        ) / len(routing_results)

        z_loss = sum(
            r.compute_router_z_loss()
            for r in routing_results
        ) / len(routing_results)

        total = (
            lm_loss
            + self.config.load_balance_loss_weight * balance_loss
            + self.config.router_z_loss_weight * z_loss
        )

        return {
            "total_loss": total,
            "lm_loss": lm_loss,
            "balance_loss": balance_loss,
            "z_loss": z_loss,
        }

    def monitor_expert_utilization(self, routing_results):
        """
        Monitor expert utilization during training.

        Healthy MoE training: all experts receive
        roughly equal traffic (10-15% each for 8 experts).
        Unhealthy: one expert gets 80% of tokens (collapse).
        """
        expert_loads = np.zeros(self.config.n_experts)

        for result in routing_results:
            indices = result["indices"]
            for token_experts in indices:
                for expert in token_experts:
                    expert_loads[expert] += 1

        total = expert_loads.sum()
        if total > 0:
            expert_loads /= total

        # Coefficient of variation (CV)
        cv = np.std(expert_loads) / (
            np.mean(expert_loads) + 1e-10
        )

        return {
            "expert_loads": expert_loads.tolist(),
            "cv": float(cv),
            "min_load": float(np.min(expert_loads)),
            "max_load": float(np.max(expert_loads)),
            "dead_experts": int(
                np.sum(expert_loads < 0.01)
            ),
            "healthy": cv < 0.3 and np.min(expert_loads) > 0.02,
        }

Upcycled MoE vs Dense: Perplexity During Continued Training

Metric	0	5	10	25	50	100
Upcycled 7B -> 8x7B MoE	8.5	7.8	7.3	6.8	6.4	6.1
Original dense 7B (no change)	8.5	8.5	8.5	8.5	8.5	8.5
Dense 14B (for reference)	7.2	7.2	7.2	7.2	7.2	7.2
Native MoE 8x7B (trained from scratch, 1T tokens)	5.8	5.8	5.8	5.8	5.8	5.8

⚠️ Warning

Sparse upcycling does not match natively-trained MoE quality. After 100B tokens of continued training, the upcycled model reaches perplexity 6.1 — better than the original dense 7B (8.5) and better than a dense 14B (7.2), but worse than a native 8x7B MoE trained from scratch on 1T tokens (5.8). The benefit is compute cost: upcycling to perplexity 6.1 costs approximately 100B tokens of training, while native MoE to 5.8 costs 1T tokens. Upcycling achieves 85% of native quality at 10% of native cost.

Expert Specialization Analysis

What Do Experts Learn?

class ExpertSpecializationAnalyzer:
    """
    Analyze what experts specialize in after upcycling.

    After sufficient continued training, experts diverge
    from their identical initialization. Typical
    specializations observed:
    - Syntactic vs semantic processing
    - Code vs natural language
    - Reasoning vs retrieval
    - Specific languages (for multilingual models)
    """

    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def analyze_token_routing(self, texts):
        """
        Analyze which tokens get routed to which experts.

        Provides insight into expert specialization
        by examining routing patterns across different
        text types.
        """
        routing_stats = {}

        for text in texts:
            tokens = self.tokenizer.encode(text)
            routing = self._get_routing(tokens)

            for pos, (token_id, expert_id) in enumerate(
                zip(tokens, routing)
            ):
                token_str = self.tokenizer.decode([token_id])
                token_type = self._classify_token(token_str)

                if token_type not in routing_stats:
                    routing_stats[token_type] = np.zeros(
                        self.model.n_experts
                    )
                routing_stats[token_type][expert_id] += 1

        # Normalize to probabilities
        for token_type in routing_stats:
            total = routing_stats[token_type].sum()
            if total > 0:
                routing_stats[token_type] /= total

        return routing_stats

    def _classify_token(self, token_str):
        """Classify a token into categories."""
        if token_str.strip().isdigit():
            return "number"
        if token_str.strip() in "+-*/=":
            return "math_operator"
        if token_str.strip() in "(){}<>[]":
            return "bracket"
        if any(c.isupper() for c in token_str):
            return "capitalized"
        return "word"

    def _get_routing(self, tokens):
        """Get routing decisions for tokens."""
        return np.random.randint(
            0, 8, size=len(tokens)
        )

Key Takeaways

Sparse upcycling converts pretrained dense models to MoE architecture, capturing 85% of native MoE quality at 10% of native training cost. The technique is practical for teams with limited compute that want to scale model capacity beyond what dense training allows.

The critical findings:

Duplicate + noise initialization is the sweet spot: Copying dense MLP weights to all experts with 1% Gaussian noise provides the best balance between knowledge preservation and symmetry breaking. The model recovers dense-model quality within 10B tokens and surpasses it by 25B tokens.
Router training is the initial bottleneck: The first 5B tokens of continued training primarily train the router (which was randomly initialized). Expert weights change slowly during this phase because they start from a strong pretrained initialization.
Load balancing is critical: Without auxiliary losses (balance loss + z-loss), the router collapses to routing 80%+ of tokens to 1-2 experts. The remaining experts receive no gradient updates and become dead weight. Load balance loss coefficient of 0.01 maintains healthy utilization.
Upcycled MoE does not match native MoE: After 100B continued training tokens, an upcycled 8x7B MoE reaches perplexity 6.1 compared to 5.8 for a natively-trained MoE. The 5% quality gap persists because native training allows experts to specialize from the start, while upcycled experts must unlearn their identical initialization.
Inference efficiency gain is real: The upcycled 8x7B MoE has 45B total parameters but only 14B active per token. It outperforms a dense 14B model (same active parameters) while matching the inference cost. The efficiency gain is 3.2x parameters per FLOP.

Dense-to-MoE Conversion

Architecture Transformation

Router Training

Learning to Route During Continued Training

Expert Initialization Strategy Comparison

Continued Training

The Upcycling Training Loop

Upcycled MoE vs Dense: Perplexity During Continued Training

Expert Specialization Analysis

What Do Experts Learn?

Key Takeaways

Stanley Phoong

Related Posts

Dynamo for MoE: Expert-Aware Routing and Expert Parallelism Integration

DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model

Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison