DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model

Part of Series Transformer Anatomy 14 of 23

1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

DeepSeek V3 is a 671B-parameter Mixture-of-Experts model that activates only 37B parameters per token. It was pre-trained on 14.8 trillion tokens using just 2.788 million H800 GPU hours — a fraction of what comparably-performing models cost. It outperforms Llama 3 405B on most benchmarks while using roughly 10x less training compute.

This post analyzes the five architectural innovations that make this efficiency possible: Multi-head Latent Attention (MLA), fine-grained MoE, auxiliary-loss-free load balancing, FP8 training, and DualPipe scheduling.

Multi-head Latent Attention (MLA)

The Problem MLA Solves

Standard GQA (used by Llama 3) stores separate K and V tensors for each KV group in the cache. For Llama 3 70B with 8 KV heads and $d_h = 128$ :

$\text{KV per token per layer} = 2 \times 8 \times 128 \times 2 = 4{,}096 \text{ bytes (FP16)}$

Across 80 layers and a 4K context: $4{,}096 \times 80 \times 4{,}096 = 1.28 \text{ GB per sequence}$ . At batch=64, that’s 82 GB — more than the model weights.

The MLA Insight

Instead of caching full K and V heads, MLA compresses them into a single low-rank latent vector $c_t^{KV}$ :

$c_t^{KV} = W^{DKV} h_t \quad \text{where } c_t^{KV} \in \mathbb{R}^{d_c}, \; d_c \ll n_h \times d_h$

During attention, K and V are reconstructed on-the-fly:

$K_t = W^{UK} c_t^{KV}, \quad V_t = W^{UV} c_t^{KV}$

The cache stores only $c_t^{KV}$ — a vector of dimension $d_c = 512$ instead of $n_h \times d_h = 8 \times 128 = 1{,}024$ for each of K and V (2,048 total). That’s a 75% reduction just from the latent compression.

The Absorption Trick

The real magic: during inference, the up-projection $W^{UK}$ can be absorbed into the query projection. Instead of:

$\text{score} = q_t^T (W^{UK} c_s^{KV})$

We precompute $\hat{W}^Q = W^Q \times W^{UK}$ and compute:

$\text{score} = (\hat{W}^Q h_t)^T c_s^{KV}$

This means we never materialize the full K tensor at all. The same trick applies to V with the output projection. Result: 93.3% KV cache reduction vs standard MHA with zero quality loss.

📊

KV Cache per Token per Layer (FP16)

Method	Cache Dimensions	Bytes/Token/Layer	Reduction
MHA (64 heads, d=128)	2 x 64 x 128	32,768	1x (baseline)
GQA-8 (Llama 3)	2 x 8 x 128	4,096	8x
MQA (1 KV head)	2 x 1 x 128	512	64x
MLA (DeepSeek V3)	512 + 192 (RoPE keys)	1,408	23.3x

Note: MLA stores a 512-dim latent vector plus 192-dim decoupled RoPE keys per token per layer.

ℹ️ Why MLA Needs Decoupled RoPE Keys

RoPE position encoding applies a rotation that depends on position — it can’t be absorbed into a static projection matrix. DeepSeek V3 solves this by storing a small set of decoupled RoPE keys ( $d_{rope} = 192$ ) alongside the latent vector. This is the 192 bytes in the table above — a small overhead for position-awareness.

Fine-Grained MoE with 256 Experts

Why More Experts

Switch Transformer used 128 experts with top-1 routing. Mixtral uses 8 experts with top-2. DeepSeek V3 uses 256 routed experts with top-8 routing, plus 1 shared expert.

The reasoning is combinatorial: with top-8 selection from 256 experts, each token can activate $\binom{256}{8} \approx 4.4 \times 10^{13}$ unique expert combinations. With top-2 from 8, only $\binom{8}{2} = 28$ combinations exist. More combinations means finer-grained specialization — each expert can focus on a narrower knowledge domain.

Shared Experts

One expert processes every token regardless of routing. This shared expert captures common knowledge (function words, basic syntax, universal patterns) that every token needs, preventing the routed experts from wasting capacity on common patterns.

The Math

Each token’s computation:

$h = h + \text{SharedExpert}(h) + \sum_{i \in \text{TopK}(g(h), 8)} g_i(h) \cdot \text{Expert}_i(h)$

where $g(h)$ is the gating function producing router logits. Total activated parameters per token: 1 shared expert + 8 routed experts = 9 expert FFNs + attention ≈ 37B of the 671B total.

Auxiliary-Loss-Free Load Balancing

The Problem with Auxiliary Losses

Standard MoE training adds a load-balancing auxiliary loss that penalizes uneven expert utilization. The problem: this loss competes with the language modeling objective, distorting gradients and reducing model quality.

DeepSeek’s Solution: Bias Terms

Instead of a gradient-based loss, DeepSeek V3 adds learnable bias terms $b_i$ to router logits:

$g_i = \text{softmax}(\text{logit}_i + b_i)$

These biases are NOT updated through backpropagation. Instead, a simple rule: if expert $i$ is overloaded, decrease $b_i$ ; if underloaded, increase $b_i$ . This is essentially a control system (PID-like) operating alongside gradient descent, with zero interference to the training signal.

⚡ Why This Matters

Auxiliary losses typically reduce final model quality by 0.1-0.3% on benchmarks — sounds small, but at frontier scale this is the difference between GPT-4 class and not. DeepSeek V3’s loss-free approach achieves perfect load balance with zero quality compromise.

Multi-Token Prediction (MTP)

Standard LLM training predicts only the next token. DeepSeek V3 trains additional prediction heads that predict tokens 2, 3, …, K steps ahead simultaneously.

Training Benefits

Richer gradient signal: Each position receives feedback from multiple future tokens, not just the immediate next one
Better representation learning: The model must maintain representations useful for multi-step prediction, leading to more robust features

Inference Benefits: Self-Speculation

The MTP heads serve as a built-in draft model for speculative decoding — no separate draft model needed. The main model generates K candidate tokens in parallel, then verifies them in one forward pass. Expected speedup: 1.8x at batch=1 with acceptance rate ~0.85.

FP8 Mixed-Precision Training

DeepSeek V3 trained the entire 671B model in FP8 — the first model at this scale to do so without loss spikes or rollbacks.

Format Selection

E4M3 (4 exponent, 3 mantissa bits): Used for forward pass GEMMs. More precision for activations and weights.
E5M2 (5 exponent, 2 mantissa bits): Used for backward pass. More dynamic range for gradients.
FP32: Used for master weights, optimizer states, LayerNorm, softmax, routing — anything sensitive to numerical precision.

Per-Tensor Delayed Scaling

Each tensor (weights, activations, gradients) gets its own dynamic scale factor. “Delayed” means the scale is computed from the previous iteration’s statistics, avoiding an extra pass over the data:

# Delayed scaling: use previous iteration's amax
scale = (448.0 / amax_history[-1]).clamp(max=max_scale)
x_fp8 = (x * scale).to(torch.float8_e4m3fn)
amax_history.append(x.abs().max())  # Record for next iteration

Result

FP8 training achieves ~1.8x throughput over BF16 on H800 GPUs, while maintaining training stability across the full 14.8T token run. The 2.788M GPU-hour budget would have been ~5M GPU-hours in BF16.

DualPipe: Near-Zero Pipeline Bubbles

Standard Pipeline Parallelism

With $P$ pipeline stages and $M$ micro-batches, 1F1B scheduling has a bubble fraction of $(P-1)/(P-1+M)$ . For $P=16$ and $M=64$ : 19% of GPU time is idle. At DeepSeek’s scale, this translates to millions of wasted GPU-hours.

DualPipe

DualPipe sends micro-batches from both ends of the pipeline simultaneously:

Forward micro-batches flow left-to-right (standard)
Additional micro-batches flow right-to-left simultaneously
Forward and backward passes of different micro-batches overlap on each GPU

The result: pipeline bubbles approach zero because each GPU always has work from one direction or the other.

Pipeline Bubble Fraction by Schedule

(% idle time)

Naive (sequential) P=8: (P-1)/P

88 % idle time

1F1B (PipeDream) M=32

19 % idle time

Interleaved (Megatron) V=4 virtual stages

10 % idle time

DualPipe (DeepSeek V3) Bidirectional

3 % idle time

Training Infrastructure

DeepSeek V3 was trained on 2,048 H800 GPUs across 256 nodes with a 5D parallelism strategy:

DeepSeek V3 Parallelism Configuration

Expert Parallelism (EP=64) 256 routed experts distributed across 64 GPUs ~4 experts per GPU

Tensor Parallelism (TP=4) Each attention/FFN layer split across 4 GPUs within a node NVLink interconnect required

Pipeline Parallelism (PP=4) Layers split across 4 groups with DualPipe Cross-node InfiniBand

Data Parallelism (DP=2) Replicate across 2 independent training groups Gradient synchronization

Communication is optimized with DeepEP — a custom all-to-all library for MoE dispatch/combine that exploits asymmetric NVLink (160 GB/s intra-node) and RDMA (50 GB/s inter-node) bandwidth.

What’s Portable from DeepSeek V3

Not every idea requires 2,048 GPUs. Several innovations are immediately applicable:

📊

DeepSeek V3 Ideas: Portability Assessment

Innovation	Portable?	Minimum Scale	Key Requirement
MLA (latent KV compression)	Yes	Any model size	Architecture change at training time
Loss-free load balancing	Yes	Any MoE model	Simple bias term implementation
Multi-token prediction	Yes	Any decoder model	Extra prediction heads during training
FP8 training	Moderate	Large models (13B+)	H100/H800 GPUs with Transformer Engine
256 fine-grained experts	No	671B+ total params	Massive expert parallelism infrastructure
DualPipe	Moderate	Multi-node training	Custom pipeline scheduler implementation

💡 The Efficiency Lesson

DeepSeek V3 demonstrates that architecture innovation (MLA, fine-grained MoE, loss-free balancing) combined with systems optimization (FP8, DualPipe, DeepEP) can reduce training cost by 5-10x without sacrificing quality. The biggest wins come from reducing KV cache (MLA), eliminating pipeline bubbles (DualPipe), and training in lower precision (FP8) — three orthogonal optimizations that multiply together.