Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o

Part of Series Frontier Model Architectures 1 of 3

1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens

On July 2025, Moonshot AI released Kimi K2 — the first open-weight model to cross the one-trillion parameter mark. With 1T total parameters, 32B activated per token, 384 routed experts, Multi-head Latent Attention, and a novel MuonClip optimizer that achieved zero training instabilities across a 15.5 trillion token run, K2 did not just join the frontier — it redefined the cost structure of getting there. On SWE-bench Verified, it scored 65.8%. On LiveCodeBench v6, 53.7%. On MMLU-Pro, it sits in the same band as Claude 3.5 Sonnet and GPT-4o.

This post is Part 1 of the Frontier Model Architectures series. We will dissect K2 at a systems level: the architecture that makes 1T parameters viable, the attention mechanism that keeps KV cache manageable at 128K context, the fine-grained expert routing that gives each token access to trillions of expert combinations, the optimizer that prevented the loss spikes that have plagued every other frontier training run, and the agentic capabilities that emerge when all these pieces come together.

1. Why Kimi K2 Matters

The Open-Weight Frontier Expands

Before K2, the largest open-weight MoE models were DeepSeek V3 at 671B total / 37B activated, and Mixtral at 47B total / 13B activated. Closed frontier models — Claude, GPT-4o, Gemini — operate at undisclosed scales but are widely estimated to use MoE architectures with total parameter counts in the trillions. K2 is the first open model to enter this territory, and it does so with a fully disclosed architecture.

The significance is threefold. First, K2 demonstrates that MoE scaling to 1T is tractable: it can be trained to completion without the catastrophic loss spikes or mid-run rollbacks that have historically plagued large-scale training. Second, it validates that the DeepSeek architectural recipe — MLA plus fine-grained MoE — generalizes beyond DeepSeek itself. Moonshot adopted MLA wholesale and extended the expert count from 256 to 384, showing that these ideas are portable and composable. Third, K2 is open-weight under an Apache 2.0 license, meaning the research community can study, fine-tune, and serve a model at a scale previously locked behind API paywalls.

What Makes K2 Different

K2 is not simply “DeepSeek V3 but bigger.” Three architectural and training choices distinguish it:

384 experts instead of 256: K2 pushes the fine-grained expert paradigm further, with $\binom{384}{8} \approx 1.8 \times 10^{14}$ possible expert combinations per token — roughly 4x more than DeepSeek V3’s $\binom{256}{8} \approx 4.4 \times 10^{13}$ . This wider combinatorial space enables finer specialization.
MuonClip optimizer: Rather than Adam or AdamW, K2 was trained with MuonClip — an extension of the Muon optimizer that adds gradient clipping for stability at unprecedented scale. The result: zero loss spikes across the full 15.5T token training run. No rollbacks. No restarts. This is, to our knowledge, the first reported frontier-scale training run with zero instability events.
32B activated / 1T total ratio: K2 activates only 3.2% of its parameters per token, compared to DeepSeek V3’s 5.5% (37B/671B) and Mixtral’s 27.7% (13B/47B). This extreme sparsity means K2 gets the quality benefits of 1T parameters at the inference cost of a 32B model.

ℹ️ Terminology: Activated Parameters

When we say “32B activated,” we mean each token passes through 32 billion parameters during a forward pass — the attention layers, the shared expert, and 8 out of 384 routed experts. The remaining 968 billion parameters exist but are not used for any given token. They contribute to the model’s total capacity (the diversity of knowledge it can represent) without contributing to per-token cost (the FLOPs required to process one token).

2. Architecture Overview

The Numbers

K2 is a decoder-only transformer with the following configuration:

📊

Kimi K2 Architecture Summary

Parameter	Value	Notes
Total parameters	1 trillion	Including all 384 experts
Activated parameters	32B	Per token: attention + shared + 8 routed experts
Layers	61	Including 1 dense layer
Experts (routed)	384	Per MoE layer, top-8 routing
Shared experts	1	Processes every token
Attention mechanism	MLA	Multi-head Latent Attention
Activation function	SwiGLU	Gated linear unit with SiLU
Vocabulary size	160,000	Large vocabulary for multilingual coverage
Context length	128K tokens	Native, not extended post-training
Training tokens	15.5 trillion	Multi-stage pre-training

The 61-layer architecture includes 1 dense layer (all experts active) and 60 MoE layers. The dense layer, typically placed early in the network, provides a shared representational foundation before the routing mechanism begins selectively activating experts.

How K2 Compares to Other MoE Models

The evolution from Mixtral to DeepSeek V3 to K2 reveals a clear scaling trajectory: more total parameters, more experts, smaller fraction activated, and increasingly fine-grained expert specialization.

📊

MoE Model Comparison

Model	Total Params	Activated	Experts	Top-K	Combinations	Activation Ratio
Mixtral 8x7B	47B	13B	8	2	28	27.7%
Grok-1	314B	~86B	8	2	28	~27.4%
DeepSeek V3	671B	37B	256 (+1 shared)	8	4.4 x 10^13	5.5%
Kimi K2	1,000B	32B	384 (+1 shared)	8	1.8 x 10^14	3.2%

Note: Combinations = C(experts, k). Higher combinations enable finer-grained expert specialization.

The trend is unmistakable. Early MoE models used a small number of large experts (Mixtral: 8 experts at ~7B each). Modern frontier MoE models use a large number of small experts (K2: 384 experts at ~2.5B each). The total parameter count has increased 21x from Mixtral to K2, but the activated parameters have only increased 2.5x. The gap between total and activated parameters is widening, and the combinatorial expressiveness of the routing space is growing exponentially.

Where the Parameters Live

Understanding K2’s parameter distribution clarifies why 1T parameters do not mean 1T FLOPs per token.

Kimi K2 Parameter Distribution (Approximate)

Attention (MLA) ~8B parameters across 61 layers Active for every token

Shared Expert (dense) ~2.5B parameters Active for every token

8 Routed Experts (active) ~20B parameters (8 x ~2.5B) Active for current token

376 Routed Experts (inactive) ~940B parameters Inactive: stored in memory but not computed

Embedding + Output Head ~1.5B parameters Active for every token

The 940B of inactive expert parameters are the defining characteristic of the MoE architecture. They consume memory (all 1T parameters must be accessible, distributed across many GPUs), but they do not consume FLOPs for any given token. This is the fundamental trade-off: memory for quality. The total parameter count determines the model’s knowledge capacity — the range of specialized behaviors it can express — while the activated parameter count determines the per-token compute cost.

SwiGLU: The Activation Function Inside Each Expert

Each of K2’s 385 experts (384 routed + 1 shared) uses SwiGLU as its activation function. SwiGLU has become the dominant activation in modern transformers, replacing ReLU and GELU in virtually every frontier model since Llama 2. The key difference is the gating mechanism: instead of a single up-projection followed by a pointwise nonlinearity, SwiGLU uses two parallel projections — one as the activation input and one as a gate:

$\text{SwiGLU}(x) = (\text{SiLU}(W_{\text{gate}} \cdot x)) \odot (W_{\text{up}} \cdot x)$

where $\text{SiLU}(z) = z \cdot \sigma(z)$ is the Sigmoid Linear Unit, $\odot$ denotes element-wise multiplication, and $W_{\text{gate}}$ , $W_{\text{up}}$ are separate weight matrices. The gate pathway learns which features to activate, while the up pathway provides what values those features should take. This multiplicative interaction gives SwiGLU substantially more expressive power than an additive nonlinearity like ReLU.

The parameter cost of SwiGLU is 50% higher than standard ReLU-based FFNs (three weight matrices instead of two: gate, up, and down), but empirical results consistently show that SwiGLU achieves better loss-per-FLOP than alternatives. At K2’s scale, where every fraction of a percent in quality matters, the extra parameter cost is easily justified. For a deeper analysis of gated activations and the FFN-as-memory hypothesis, see Part 9: The FFN and SwiGLU in the Transformer Anatomy series.

3. Multi-head Latent Attention in Kimi K2

The KV Cache Problem at 128K Context

K2 supports 128K token context natively. At this scale, the KV cache becomes the dominant memory consumer during inference, easily exceeding the model weight footprint at even modest batch sizes.

Consider what standard Multi-Head Attention (MHA) would cost for a K2-like architecture. Assume 64 attention heads with head dimension $d_h = 128$ , across 61 layers, in BF16:

$\text{KV per token per layer} = 2 \times n_h \times d_h \times 2 = 2 \times 64 \times 128 \times 2 = 32{,}768 \text{ bytes}$

$\text{KV per token, all layers} = 32{,}768 \times 61 = 1{,}998{,}848 \text{ bytes} \approx 1.9 \text{ MB}$

At 128K context, a single sequence under MHA would require:

$1.9 \text{ MB} \times 128{,}000 = 243 \text{ GB}$

That is more than the entire HBM capacity of three H100 GPUs — for a single sequence, before the model weights are even loaded. At batch size 8, you would need nearly 2 TB of KV cache alone. This is clearly unworkable.

How MLA Compresses the KV Cache

K2 uses the same Multi-head Latent Attention mechanism introduced in DeepSeek V2 and used in DeepSeek V3. Instead of caching the full K and V projections for all heads, MLA compresses them into a single low-rank latent vector.

For each token at position $t$ , the hidden state $h_t$ is projected down into a compact latent representation:

$c_t^{KV} = W^{DKV} h_t \quad \text{where } c_t^{KV} \in \mathbb{R}^{d_c}$

Here $d_c$ is the latent dimension — typically 512, far smaller than the full KV dimension $n_h \times d_h = 8{,}192$ for 64 heads. During attention computation, the full K and V tensors are reconstructed on-the-fly from the latent:

$K_t = W^{UK} c_t^{KV}, \quad V_t = W^{UV} c_t^{KV}$

The cache stores only $c_t^{KV}$ plus a small set of decoupled RoPE keys ( $d_{rope} \approx 192$ dimensions) needed for position-aware attention. The total cache per token per layer becomes roughly:

$\text{MLA cache} = (d_c + d_{rope}) \times 2 = (512 + 192) \times 2 = 1{,}408 \text{ bytes}$

Compare that to the MHA baseline of 32,768 bytes — a 23.3x reduction.

⚡ KV Cache Savings at 128K Context

With MLA, K2’s KV cache for a single 128K sequence across 61 layers is approximately:

$1{,}408 \times 61 \times 128{,}000 = 10.3 \text{ GB}$

Compare to the MHA baseline of 243 GB. This 23x compression is what makes 128K context viable on current hardware. At batch=8, K2 needs ~82 GB of KV cache — tight but feasible on a multi-GPU setup — versus the 1.9 TB that MHA would demand.

The Absorption Trick

The mathematical elegance of MLA lies in the absorption of the up-projection matrices into the query and output projections. During inference, instead of reconstructing K from the latent and then computing the attention score:

$\text{score} = q_t^T (W^{UK} c_s^{KV})$

we precompute a combined query projection $\hat{W}^Q = W^Q \times W^{UK}$ and compute:

$\text{score} = (\hat{W}^Q h_t)^T c_s^{KV}$

The up-projection $W^{UK}$ is never applied at inference time — it is absorbed into the query weights during model preparation. The same trick applies to the value side: the up-projection $W^{UV}$ is absorbed into the output projection. The result is that attention operates directly on the latent vectors $c_t^{KV}$ without ever materializing the full K or V tensors.

This absorption is only possible because the up-projections are static (independent of position). RoPE, which applies position-dependent rotations, cannot be absorbed this way, which is why MLA stores a small set of decoupled RoPE keys alongside the latent. This is the minimal overhead required for position-awareness — 192 dimensions out of the original 8,192.

For readers who want the full mathematical derivation, see Part 6 (Attention Variants) and Part 14 (DeepSeek V3) in the Transformer Anatomy series, which cover MLA from the perspective of KV cache taxonomy and practical systems impact, respectively.

4. 384 Experts with Top-8 Routing

The Combinatorial Argument for More Experts

The number of unique expert combinations available to each token is:

$\binom{N}{k} = \binom{384}{8} = \frac{384!}{8! \times 376!}$

This evaluates to approximately $1.8 \times 10^{14}$ — 180 trillion unique combinations. For context:

📊

Expert Combination Space by Model

Model	Experts	Top-K	Combinations	Log10
Mixtral 8x7B	8	2	28	1.4
Grok-1	8	2	28	1.4
Switch Transformer	128	1	128	2.1
DeepSeek V3	256	8	4.4 x 10^13	13.6
Kimi K2	384	8	1.8 x 10^14	14.3

Note: Log10 of combinations shows the exponential growth in representational capacity.

Why does this matter? Each unique combination of 8 experts represents a different “virtual sub-model” — a different computational pathway through the network. With 28 combinations (Mixtral), the model essentially has 28 distinct modes of processing. With 180 trillion combinations, K2 can assign a nearly unique processing pipeline to every token it will ever see. This enables extremely fine-grained specialization: one expert might specialize in Python f-string syntax, another in Japanese honorific particles, another in differential geometry notation.

The combinatorial richness also provides graceful degradation. If a token’s optimal expert set is unavailable (e.g., due to load-balancing constraints), there are many near-optimal alternatives. In a Mixtral-style model with only 28 combinations, any routing constraint significantly limits options.

Expert Size and Granularity

With 384 routed experts and approximately 960B parameters in the expert layers, each expert is roughly:

$\text{Expert size} \approx \frac{960B}{384} \approx 2.5B \text{ parameters}$

Compare this to Mixtral’s experts at ~7B each. K2’s experts are roughly 3x smaller, which means each one specializes more narrowly. The trade-off: smaller experts have less individual capacity, but the top-8 routing means 8 of them collaborate on each token, collectively providing 20B parameters of expert compute. The argument is that 8 specialized 2.5B experts working together outperform 2 generalist 7B experts, because the specialization is more targeted.

The Shared Expert

Like DeepSeek V3, K2 includes one shared expert that processes every token regardless of routing decisions. The shared expert serves a critical architectural role: it captures universal knowledge — common syntactic patterns, function words, basic arithmetic, formatting conventions — that every token needs. This prevents the routed experts from wasting capacity on universally-needed knowledge and allows them to focus on specialized domains.

The forward pass through a K2 MoE layer:

def k2_moe_forward(hidden, shared_expert, routed_experts, router, k=8):
    """
    Kimi K2 MoE layer forward pass.
    384 routed experts, 1 shared expert, top-8 routing.
    """
    # Shared expert: always active, captures universal knowledge
    shared_out = shared_expert(hidden)

    # Router: compute gating scores over all 384 experts
    logits = router(hidden)                          # [batch, 384]
    topk_vals, topk_ids = logits.topk(k, dim=-1)    # [batch, 8]
    gates = F.softmax(topk_vals, dim=-1)             # Normalize gate weights

    # Dispatch to selected experts
    routed_out = torch.zeros_like(hidden)
    for i in range(k):
        expert_id = topk_ids[:, i]
        gate_weight = gates[:, i].unsqueeze(-1)
        routed_out += gate_weight * routed_experts[expert_id](hidden)

    return shared_out + routed_out

Expert Parallelism at 384 Experts

Distributing 384 experts across a GPU cluster requires careful planning. With expert parallelism degree EP=64 (a common configuration for this scale), each GPU holds $384 / 64 = 6$ experts. The dispatch-combine communication pattern sends each token’s hidden state to 8 different GPUs (one for each selected expert), processes them in parallel, and returns the results.

The communication volume per MoE layer, for a micro-batch of $B$ tokens with hidden dimension $D$ in BF16:

$\text{Dispatch} = B \times k \times D \times 2 = B \times 8 \times D \times 2 \text{ bytes}$

For $B = 4096$ and $D = 7168$ (estimated for K2’s hidden dimension):

$\text{Dispatch} = 4096 \times 8 \times 7168 \times 2 \approx 449 \text{ MB per layer}$

Across 60 MoE layers, that is approximately 27 GB of all-to-all communication per forward pass. This is manageable with high-bandwidth interconnects (NVLink at 160+ GB/s intra-node, InfiniBand at 50+ GB/s inter-node), but it underscores why MoE at this scale requires premium networking infrastructure.

Load Balancing with 384 Experts

The load-balancing challenge intensifies with more experts. With 384 experts and top-8 routing, the expected token load per expert is $8/384 \approx 2.08\%$ of all tokens. Even small imbalances become problematic: if one expert receives 4% of tokens instead of 2%, it becomes a throughput bottleneck while half the other experts sit underutilized.

K2 employs auxiliary-loss-free load balancing, following DeepSeek V3’s approach. Learnable bias terms $b_i$ are added to the router logits:

$g_i = \text{softmax}(\text{logit}_i + b_i)$

These biases are adjusted by a control loop outside the gradient computation: overloaded experts get their bias decreased (reducing routing probability), underloaded experts get their bias increased. This achieves near-perfect load balance without distorting the training gradient — a critical property when training at 15.5T tokens, where even tiny gradient distortions compound into measurable quality degradation.

💡 Why Auxiliary-Loss-Free Balancing Is Critical at 1T Scale

At K2’s scale, the auxiliary loss approach used by Switch Transformer and GShard would be particularly damaging. The auxiliary loss must balance 384 experts instead of 8 or 128, requiring a stronger balancing signal. But a stronger auxiliary loss means more gradient interference with the language modeling objective. The loss-free approach sidesteps this trade-off entirely — you get exact load balance with zero quality impact, regardless of expert count.

5. The MuonClip Optimizer

Why Adam Struggles at Frontier Scale

The Adam optimizer has been the default for transformer training since the original paper. It maintains per-parameter first and second moment estimates, adapting the learning rate for each weight independently. For models up to roughly 100B parameters, Adam works well with careful hyperparameter tuning.

At frontier scale — hundreds of billions to trillions of parameters — Adam shows two weaknesses:

Convergence speed: Adam’s adaptive learning rates are derived from diagonal approximations to the loss curvature. For large models with complex loss landscapes, this diagonal approximation becomes increasingly inaccurate. The optimizer takes many steps in suboptimal directions, wasting precious training FLOPs.

Stability: Large models exhibit more frequent loss spikes — sudden, dramatic increases in training loss that require rolling back to a checkpoint and potentially adjusting hyperparameters. These instabilities are believed to arise from sharp loss landscape features that Adam’s moment estimates fail to anticipate. For DeepSeek V3’s 14.8T token run, the team reported having to carefully manage training stability. For Llama 3 405B, Meta reported multiple mid-training interventions.

The Muon Optimizer

Muon (Momentum + Newton) replaces Adam’s diagonal preconditioning with a more principled update rule based on matrix preconditioning. For a weight matrix $W$ with gradient $G$ :

Adam update: $W \leftarrow W - \eta \cdot \frac{m}{(\sqrt{v} + \epsilon)}$

where $m$ and $v$ are running first and second moment estimates, computed element-wise. This is a diagonal approximation — it treats each weight independently, ignoring correlations between weights in the same row or column.

Muon update: Muon uses the Nesterov momentum-based gradient and then applies Newton-style matrix preconditioning. The key step is orthogonalizing the update direction using the singular value decomposition (SVD) or an approximation thereof:

$\Delta W = \text{NewtonSchulz}(G_{\text{momentum}})$

The Newton-Schulz iteration is an efficient method for computing the matrix sign function, which effectively normalizes the gradient update across all singular directions simultaneously. In pseudocode:

def muon_step(W, G, momentum_buffer, lr):
    """
    Simplified Muon optimizer step.
    G: gradient of W (a 2D weight matrix)
    """
    # 1. Nesterov momentum
    momentum_buffer = 0.95 * momentum_buffer + G
    G_look_ahead = G + 0.95 * momentum_buffer

    # 2. Newton-Schulz orthogonalization (5 iterations)
    # Approximates the matrix sign function: G / sqrt(G^T G)
    X = G_look_ahead
    for _ in range(5):
        A = X @ X.T
        X = 1.5 * X - 0.5 * A @ X  # Newton-Schulz iteration

    # 3. Update
    W -= lr * X
    return W

The Newton-Schulz orthogonalization ensures that the gradient update has uniform singular values — it pushes equally in all directions of the weight matrix’s column space. This prevents the optimizer from repeatedly updating along a few dominant gradient directions while ignoring others, which is a common failure mode of Adam at large scale.

From Muon to MuonClip

The original Muon optimizer was developed for smaller-scale experiments (up to ~1B parameters). Scaling it to K2’s trillion parameters introduced a new challenge: while Muon’s matrix preconditioning improves convergence, the orthogonalized updates can occasionally produce very large weight changes — particularly during phase transitions in training (e.g., when the model suddenly learns a new capability and the loss landscape shifts rapidly).

MuonClip adds a gradient clipping mechanism specifically designed for Muon’s update structure. Rather than simply clipping the gradient norm (as is standard with Adam), MuonClip clips the post-orthogonalization update:

$\Delta W_{\text{clipped}} = \begin{cases} \Delta W & \text{if } \|\Delta W\|_F \leq \tau \\ \tau \cdot \frac{\Delta W}{\|\Delta W\|_F} & \text{otherwise} \end{cases}$

where $\tau$ is a carefully tuned threshold and $\|\cdot\|_F$ is the Frobenius norm. The critical insight is that clipping after the Newton-Schulz step is more principled than clipping before it. Pre-orthogonalization clipping would distort the balanced update directions that Muon works hard to compute. Post-orthogonalization clipping preserves the directional balance while simply limiting the step magnitude.

The Achievement: Zero Instabilities

The headline result of MuonClip is K2’s training stability: zero loss spikes across 15.5 trillion tokens. To appreciate how remarkable this is, consider the history of frontier training runs:

📊

Training Stability of Frontier Models

Model	Training Tokens	Optimizer	Reported Instabilities	Required Rollbacks
GPT-4 (est.)	~13T	Adam variant	Multiple loss spikes	Yes (estimated)
Llama 3 405B	15T	AdamW	Multiple interventions	Yes
DeepSeek V3	14.8T	AdamW	Managed carefully	Minimal
Kimi K2	15.5T	MuonClip	Zero	Zero

Note: Instability data based on published technical reports and interviews.

Each rollback during training wastes compute — the tokens processed between the checkpoint and the spike must be re-processed. For a training run consuming thousands of GPU-hours per day, a single rollback can cost millions of dollars. More insidiously, repeated instabilities force conservative hyperparameter choices (lower learning rates, more warmup), which slow convergence and increase total training cost.

MuonClip eliminates this problem not by being conservative, but by being geometrically principled. The Newton-Schulz preconditioning ensures that updates are balanced across all directions, and the post-orthogonalization clipping prevents any single update from being catastrophically large. The optimizer can use aggressive learning rates (faster convergence) without the stability risk that would accompany such rates under Adam.

⚡ MuonClip's Compute Savings

Zero instabilities means zero wasted compute from rollbacks. For a training run of K2’s scale (estimated at millions of GPU-hours), avoiding even 2-3 rollbacks could save 5-10% of total training cost — potentially tens of millions of dollars. Beyond the direct savings, the predictability of a stable training run simplifies scheduling, budgeting, and infrastructure planning.

Optimizer Memory Overhead

One consideration with Muon-family optimizers is memory. Adam stores two state tensors per parameter (first and second moments): at 4 bytes each, that is 8 bytes of optimizer state per parameter. For 1T parameters, Adam would require 8 TB of optimizer state.

Muon’s memory overhead depends on implementation details. The momentum buffer is equivalent to Adam’s first moment (4 bytes per parameter). The Newton-Schulz iteration operates on gradient-sized matrices and does not require persistent per-parameter state beyond the momentum buffer. In practice, Muon’s memory footprint is comparable to SGD with momentum — roughly 4 bytes per parameter — a 2x reduction over Adam.

At 1T parameters, this difference is significant: 4 TB versus 8 TB of optimizer state. Distributed across hundreds of GPUs, this translates to meaningful memory savings per device that can be used for larger batch sizes or longer sequences.

6. Training at Scale

The 15.5 Trillion Token Dataset

K2 was pre-trained on 15.5 trillion tokens — one of the largest disclosed training datasets. For context:

Pre-training Data Scale (Trillions of Tokens)

(T tokens)

Llama 3 405B Meta, 2024

15 T tokens

Kimi K2 Moonshot, 2025

15.5 T tokens

DeepSeek V3 DeepSeek, 2024

14.8 T tokens

Llama 2 70B Meta, 2023

2 T tokens

Mixtral 8x7B Mistral, 2024

4 T tokens

The Chinchilla scaling law suggests that optimal training allocates compute roughly equally between model parameters and training tokens. For 1T parameters, Chinchilla would prescribe approximately 20T tokens. K2’s 15.5T tokens is slightly below this optimal, suggesting that Moonshot may have been compute-constrained or that MoE models’ effective parameter count (for scaling law purposes) is closer to the activated 32B than the total 1T.

Multi-Stage Training

K2 follows a multi-stage pre-training recipe, sometimes called the “Moonlight” approach:

Stage 1 — General pre-training: The bulk of the 15.5T tokens, trained on a broad web corpus with standard next-token prediction. This stage establishes the model’s general knowledge and language capabilities.

Stage 2 — Long-context extension: The context window is extended to 128K tokens using a combination of position interpolation and targeted long-context data. This stage is critical: naive training on short sequences would not produce a model capable of utilizing 128K context effectively. The data mix shifts to include longer documents, multi-document reasoning tasks, and code repositories.

Stage 3 — Quality refinement: A smaller set of high-quality data is used for a final pre-training phase. This typically includes curated textbooks, peer-reviewed papers, verified code, and high-quality multilingual data. The learning rate is reduced, and the focus shifts from breadth to depth of knowledge.

Infrastructure Requirements

Training a 1T MoE model requires a massive distributed system. While Moonshot has not disclosed their full infrastructure, we can estimate requirements from the architecture:

Memory: 1T parameters in BF16 require 2 TB of memory for weights alone. Adding optimizer state (4 TB with MuonClip), activations, and KV cache for training sequences, the total memory footprint is likely 8-12 TB. With H100s at 80 GB each, this requires a minimum of 100-150 GPUs just to hold the model.

Parallelism: K2 almost certainly uses a 4-5D parallelism strategy similar to DeepSeek V3:

Expert Parallelism (EP): 64-128 GPUs for distributing 384 experts
Tensor Parallelism (TP): 4-8 GPUs for partitioning attention and expert FFN layers
Pipeline Parallelism (PP): 4-8 stages for distributing the 61 layers
Data Parallelism (DP): Replication across training groups for throughput

Networking: The all-to-all communication pattern for expert parallelism demands high-bandwidth, low-latency interconnects. Intra-node NVLink (900 GB/s on H100 NVSwitch systems) handles expert dispatch within a node, while cross-node communication relies on InfiniBand or equivalent at 400 Gb/s per port.

Total GPU-hours: Based on the model size, training tokens, and estimated efficiency, K2’s pre-training likely required on the order of 3-5 million H100-equivalent GPU-hours. At current cloud pricing ( $2-3/GPU-hour for H100), this represents a training cost of$ 6-15 million — steep, but significantly less than training a dense model of equivalent quality.

⚠️ The Serving Memory Challenge

While K2 activates only 32B parameters per token (a single GPU could handle this compute), all 1T parameters must be accessible for routing to work. In BF16, the weights alone require 2 TB of memory — 25 H100 GPUs just to hold the model. Expert offloading to CPU/SSD can reduce the GPU count, but at a significant latency penalty. This is the fundamental MoE serving trade-off: low per-token compute, high total memory footprint.

7. Agentic Capabilities

Why K2 Excels at Tool Use and Code

K2’s benchmark results on agentic tasks are among the most impressive in its profile:

Kimi K2 Agentic and Code Benchmarks

(%)

SWE-bench Verified Software engineering tasks

65.8 %

LiveCodeBench v6 Real-time coding

53.7 %

MMLU-Pro General knowledge

85.7 %

GPQA Diamond Graduate-level QA

63.2 %

MATH-500 Mathematical reasoning

90.2 %

SWE-bench Verified at 65.8% is particularly notable. This benchmark requires the model to read a GitHub issue, navigate a real codebase, identify the relevant files, and produce a working patch — a multi-step agentic workflow that demands code understanding, planning, tool use, and precise editing. K2’s performance here is competitive with Claude 3.5 Sonnet and GPT-4o, and represents a significant step forward for open-weight models on agentic tasks.

The MoE Architecture Advantage for Diverse Tasks

There is a plausible architectural argument for why MoE models, and particularly fine-grained MoE models like K2, excel at diverse task handling. Agentic workflows involve many distinct sub-tasks in sequence: reading natural language (the issue description), understanding code structure (AST parsing, import resolution), reasoning about program semantics (what the bug is), generating code (the fix), and verifying correctness (does it match the test expectations).

Each sub-task may benefit from different specialized knowledge. A dense model must handle all sub-tasks with the same 32B parameters. K2’s routing mechanism can dynamically select different expert combinations for each step: code-reading experts for understanding the codebase, reasoning experts for diagnosing the bug, and code-generation experts for writing the fix. The 180 trillion combinations provide enough specialization that the model can assemble a near-optimal “virtual sub-model” for each sub-task.

This hypothesis is supported by empirical observations across MoE models: they consistently show disproportionate gains on diverse benchmarks (those requiring many different skills) relative to their gains on narrow benchmarks (those testing a single skill). K2’s strong SWE-bench performance — which requires the widest diversity of skills — aligns with this pattern.

Tool Use and Function Calling

K2 was specifically optimized during post-training for tool use and function calling. Moonshot’s RLHF and instruction tuning pipeline emphasizes:

Structured output generation: K2 produces reliable JSON for function calls, with low rates of malformed output
Multi-turn tool chains: The model maintains coherent plans across multiple tool-use steps, correctly interpreting tool outputs and deciding on next actions
Error recovery: When a tool call fails (e.g., a file not found, an API error), K2 can diagnose the failure and retry with corrected parameters

The 128K context window is essential for agentic workflows. Navigating a real codebase often requires holding multiple files in context simultaneously — the bug report, the failing test, the implementation file, related utility files, type definitions. At 128K tokens, K2 can hold roughly 400-500 files of moderate length in a single context window, covering most real-world codebase exploration tasks without chunking or retrieval augmentation.

Benchmark Context: K2 Against Frontier Models

To understand K2’s position in the landscape, consider its performance relative to both open-weight and closed-source frontier models across diverse benchmarks:

📊

Kimi K2 vs Frontier Models (Selected Benchmarks)

Benchmark	Kimi K2	Claude 3.5 Sonnet	GPT-4o	DeepSeek V3
SWE-bench Verified	65.8%	~64%	~62%	~55%
LiveCodeBench v6	53.7%	~51%	~49%	~47%
MMLU-Pro	85.7%	~86%	~85%	~84%
GPQA Diamond	63.2%	~65%	~64%	~60%
MATH-500	90.2%	~91%	~89%	~88%

Note: Approximate scores for comparison. Closed model scores based on published evaluations. K2's agentic benchmarks (SWE-bench, LiveCodeBench) show the strongest competitive positioning.

The pattern is instructive: K2’s advantage is most pronounced on agentic and code-heavy benchmarks, where the diversity of required skills plays to the MoE architecture’s strengths. On pure knowledge benchmarks like MMLU-Pro, the gap between models narrows because the task primarily tests breadth of memorized knowledge rather than multi-step reasoning across different skill domains. This suggests that as AI applications shift increasingly toward agentic workflows — tool use, code generation, multi-step planning — the fine-grained MoE architecture may provide a structural advantage that dense models cannot match without proportionally more compute.

8. What Kimi K2 Means for the Field

The Convergence: MoE + MLA + Novel Optimizers

K2 demonstrates that three independent lines of research — sparse MoE architectures, latent attention compression, and advanced optimization — converge to make trillion-parameter models practical. Each innovation addresses a different bottleneck:

📊

How K2's Innovations Address Different Bottlenecks

Bottleneck	Innovation	Impact
Per-token compute cost	MoE (384 experts, top-8)	1T quality at 32B compute cost
KV cache memory	MLA (latent compression)	23x reduction vs MHA at 128K context
Training stability	MuonClip optimizer	Zero instabilities across 15.5T tokens
Expert specialization	Fine-grained experts (384 x 2.5B)	180T unique expert combinations
Load balancing	Auxiliary-loss-free bias terms	Perfect balance with zero quality impact
Serving memory	Expert offloading (future)	Trade latency for reduced GPU count

None of these innovations exist in isolation. MoE requires MLA (without KV cache compression, the 128K context would be unservable). MuonClip requires MoE (without the reduced per-token compute from sparsity, the training budget would be prohibitive). Fine-grained experts require auxiliary-loss-free balancing (384 experts would be unbalanceable with traditional auxiliary losses). The architecture is a system, not a collection of independent features.

The DeepSeek Recipe, Generalized

K2 is the strongest evidence to date that the DeepSeek V3 architectural recipe — MLA plus fine-grained MoE plus auxiliary-loss-free load balancing — is not a one-off achievement but a generalizable template for frontier model development. Moonshot adopted this recipe, extended it (more experts, novel optimizer), and achieved competitive results. This lowers the barrier for other labs: the architectural playbook is proven, published, and now validated by multiple independent teams.

We should expect to see more models following this template in 2025-2026:

Expert counts continuing to increase (512? 1024?)
MLA becoming the default attention mechanism for large-scale models
Muon-family optimizers replacing Adam for frontier training
Activation ratios dropping below 3% as total parameter counts grow

Open Questions

Several architectural questions remain unresolved:

Optimal expert granularity: K2 uses 384 experts at ~2.5B each. Would 768 experts at ~1.25B each be better? At some point, individual experts become too small to learn meaningful specialization. Finding this minimum viable expert size is an active research question.

MuonClip generalization: K2’s zero-instability claim is impressive, but we do not yet know if MuonClip generalizes to all model scales and data distributions, or if its hyperparameters require significant tuning per setup. Broader adoption across labs will test this.

Serving efficiency: K2’s 1T parameter footprint makes serving expensive in terms of memory (2 TB for weights alone). Expert offloading, quantization (INT4/INT8 experts), and speculative expert loading are all active research directions for making trillion-parameter MoE serving practical. The model’s quality at INT4 quantization — and whether the fine-grained expert structure makes quantization harder or easier — is an important open question.

Scaling beyond 1T: If the trend continues (47B to 671B to 1T), the next generation might target 2-5T parameters with 64-100B activated. Can MuonClip maintain zero instabilities at 2T? Can MLA scale to 256K or 1M context without quality degradation? Can expert counts reach 1,024 without the routing mechanism collapsing to only using a subset?

Connection to This Series

This post opens the Frontier Model Architectures series, which will examine the architectural choices of leading models in detail. Future posts will cover:

Gemini’s approach to multimodal MoE
Llama 4’s dense-to-MoE transition
The evolution of context length extension techniques
Novel attention variants beyond MLA

For background on the component technologies that K2 builds on, refer to the Transformer Anatomy series:

Part 6: Attention Variants (MHA, MQA, GQA, MLA) — the full derivation of MLA’s KV cache compression
Part 9: The FFN and SwiGLU — why gated activations outperform ReLU and GELU
Part 10: Mixture of Experts — router design, load balancing, and expert parallelism fundamentals
Part 14: DeepSeek V3 Deep Dive — the architectural foundation that K2 extends

ℹ️ Series Note

This is Part 1 of the Frontier Model Architectures series. Unlike the Transformer Anatomy series, which builds up from first principles, this series assumes familiarity with transformer fundamentals and focuses on the specific architectural choices that define each frontier model. Each post can be read independently, but they will reference each other and the foundational series as needed.

Summary

Kimi K2 is the first open-weight model to demonstrate that the trillion-parameter MoE approach is not only tractable but competitive with the best closed models. Its key contributions are:

384-expert fine-grained MoE with $1.8 \times 10^{14}$ routing combinations, enabling extreme specialization while activating only 32B parameters per token
Multi-head Latent Attention that compresses the KV cache by 23x, making 128K context feasible without exotic hardware configurations
MuonClip optimizer that achieved zero training instabilities across 15.5T tokens — a stability result unprecedented at frontier scale
Strong agentic performance (65.8% SWE-bench Verified) that validates the hypothesis that fine-grained MoE architectures naturally excel at diverse, multi-step tasks

The architectural lesson from K2 is that frontier model development is now a systems integration challenge. No single innovation — not MoE, not MLA, not MuonClip — is sufficient on its own. The competitive advantage comes from combining them into a coherent system where each innovation addresses a different bottleneck: compute, memory, stability, and specialization. Moonshot demonstrated that this integration is reproducible, and in doing so, gave the open research community a reference point for what a trillion-parameter model looks like from the inside.