Part of Series Transformer Anatomy 3 of 23
1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

In Part 2 of this series, we established that tokenizers convert raw text into sequences of integer IDs. The string “The cat sat” might become [464, 3797, 3332] — three integers, nothing more. But a neural network cannot do anything useful with raw integers. It needs continuous, differentiable representations that it can transform through matrix multiplications, nonlinearities, and gradient updates. The embedding layer is the bridge between the discrete world of tokens and the continuous world of neural computation.

This post covers that bridge in full detail: what an embedding layer actually is, how it is initialized, how its dimension is chosen, the weight tying trick that saves billions of parameters, what the geometry of the resulting vector space reveals about learned representations, and the practical memory costs at modern scales.


1. What an Embedding Layer Actually Is

The Lookup Table

At its core, an embedding layer is a lookup table. Nothing more. It is a matrix ERV×dE \in \mathbb{R}^{V \times d}, where VV is the vocabulary size and dd is the embedding dimension (often called dmodeld_\text{model}). Each of the VV rows is a dd-dimensional vector that represents one token in the vocabulary.

Given a token ID ii, the embedding layer returns row ii of the matrix:

embed(i)=E[i,:]Rd\text{embed}(i) = E[i, :] \in \mathbb{R}^d

In PyTorch, this is nn.Embedding:

import torch
import torch.nn as nn

class TokenEmbedding(nn.Module):
    """
    The embedding layer: a learnable lookup table.
    Input:  token IDs   [batch_size, seq_len]       (integers)
    Output: embeddings  [batch_size, seq_len, d_model] (floats)
    """
    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        # This is the entire layer: a matrix of shape [V, d_model]
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        # token_ids: [batch_size, seq_len] of integers in [0, vocab_size)
        # output:    [batch_size, seq_len, d_model] of floats
        return self.embedding(token_ids)

# Example: Llama 3 scale
embed = TokenEmbedding(vocab_size=128_256, d_model=8192)
token_ids = torch.tensor([[464, 3797, 3332]])  # "The cat sat"
vectors = embed(token_ids)  # Shape: [1, 3, 8192]

The forward pass is not a matrix multiply. It is a table lookup — an indexing operation. The GPU fetches three rows from a 128K-by-8192 matrix. This distinction matters for performance: embedding lookups are memory-bandwidth-bound, not compute-bound.

Why Learned, Not Hand-Crafted

Early NLP systems used hand-crafted features: one-hot encodings, TF-IDF vectors, or manually designed linguistic features. These required domain expertise and never generalized well. The critical insight behind learned embeddings, first popularized by Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), is that the embedding vectors should be parameters of the model, updated by gradient descent during training.

The model discovers its own representations. Nobody tells the model that “cat” and “dog” should be nearby in embedding space — the training objective forces this to emerge. If predicting the next token after “The ___ sat on the mat” works equally well for “cat” and “dog,” then the gradients for those tokens point in the same direction, and their embeddings converge.

This is a profound shift. The representation is no longer a fixed property of the input — it is a learned function of the entire training distribution.

Why Dense Vectors, Not One-Hot

The naive alternative to a learned embedding is a one-hot vector: a VV-dimensional vector that is all zeros except for a single 1 at the position corresponding to the token ID. For a vocabulary of V=128,000V = 128{,}000 tokens, each one-hot vector has 128,000 dimensions.

One-hot encoding has three fatal problems:

  1. Dimensionality: passing 128K-dimensional vectors through a transformer with hidden size 8192 would require an immediate projection, which is equivalent to… a learned embedding matrix. One-hot followed by a linear layer is mathematically identical to a lookup table.

  2. No generalization: in one-hot space, every token is equidistant from every other token. The cosine similarity between any two distinct one-hot vectors is exactly 0. The model has no inductive bias toward grouping related tokens.

  3. Sparsity and memory: storing and transmitting 128K-dimensional sparse vectors is wasteful. A single dense vector of 8192 floats is 64x smaller and dense, which is far better for GPU memory access patterns.

The embedding layer compresses VV-dimensional discrete space into dd-dimensional continuous space, where dVd \ll V. This compression forces the model to share structure: similar tokens must occupy nearby regions because there are not enough dimensions to give every token its own orthogonal direction.

ℹ️ The Key Insight

An embedding lookup is mathematically identical to multiplying a one-hot vector by the embedding matrix: onehot(i)E=E[i,:]\text{onehot}(i)^\top E = E[i, :]. The lookup table is just an efficient implementation of this sparse matrix multiply. The model learns the embedding matrix EE end-to-end via backpropagation.


2. Initialization Strategies

Why Initialization Matters

Before training begins, the embedding matrix must be filled with initial values. This might seem like a minor detail — after all, training will adjust the values anyway. But poor initialization can cause serious problems:

  • Representation collapse: if all embeddings start too similar, tokens are indistinguishable early in training. The model cannot differentiate between “the” and “quantum,” and gradients for the entire vocabulary point in nearly the same direction. Escaping this regime can take thousands of steps.

  • Exploding activations: if initial embedding norms are too large, they amplify through layer normalization and attention, causing unstable training. Gradient clipping can mask the problem but does not fix it.

  • Slow convergence: even if training eventually converges, bad initialization wastes compute. At the scale of training a 70B parameter model for trillions of tokens, “wasting 5% more steps” translates to millions of dollars in GPU hours.

Random Normal Initialization

The most common initialization is to draw each element of EE from a normal distribution:

EijN(0,σ2)E_{ij} \sim \mathcal{N}(0, \sigma^2)

The question is: what should σ\sigma be?

Llama family: Meta uses σ=0.02\sigma = 0.02 for Llama 2 and Llama 3. This is a simple, empirically validated choice. With d=8192d = 8192, the expected L2 norm of each embedding vector is approximately σd=0.02×81921.81\sigma \sqrt{d} = 0.02 \times \sqrt{8192} \approx 1.81. This keeps activations in a reasonable range for the subsequent RMSNorm layers.

GPT-2: OpenAI uses a scaled initialization σ=0.02/2nlayers\sigma = 0.02 / \sqrt{2 \cdot n_\text{layers}}. For GPT-2 Large with 36 layers, that gives σ0.0024\sigma \approx 0.0024. The reasoning is that residual connections accumulate contributions from all layers, so each layer’s initial contribution should be smaller when there are more layers.

Xavier / Glorot Initialization

Xavier initialization (Glorot and Bengio, 2010) sets σ\sigma based on the fan-in and fan-out dimensions:

σ=2nin+nout\sigma = \sqrt{\frac{2}{n_\text{in} + n_\text{out}}}

For an embedding layer, nin=Vn_\text{in} = V and nout=dn_\text{out} = d. Since VdV \gg d, this gives a very small σ\sigma. In practice, most LLM codebases do not use Xavier for the embedding layer specifically, because the embedding lookup is not a true matrix multiplication — the effective fan-in is 1 (a single row is selected), not VV.

Scaled Initialization

Some modern architectures use scaled initialization that accounts for the model depth. The principle is simple: if the residual stream accumulates additive contributions from LL layers, and each contribution is independent, then after LL layers the variance of the residual grows by a factor of LL. To compensate:

σresidual=σbaseLorσbase2L\sigma_\text{residual} = \frac{\sigma_\text{base}}{\sqrt{L}} \quad \text{or} \quad \frac{\sigma_\text{base}}{\sqrt{2L}}

The factor of 2 accounts for the two residual additions per transformer block (one from attention, one from the MLP).

📊

Embedding Initialization in Major LLMs

ModelMethodStd Dev (sigma)d_modelExpected Norm
GPT-2 (124M) Normal (scaled) 0.02 768 0.55
GPT-2 Large (774M) Normal (scaled) 0.02/sqrt(72) = 0.0024 1280 0.09
Llama 2 7B Normal 0.02 4096 1.28
Llama 2 70B Normal 0.02 8192 1.81
Llama 3 8B Normal 0.02 4096 1.28
GPT-3 175B Normal (scaled) 0.02/sqrt(192) = 0.0014 12288 0.17
Note: Expected norm computed as sigma * sqrt(d_model). GPT-2/GPT-3 scale sigma by 1/sqrt(2*n_layers) for residual layers, but use base sigma=0.02 for embeddings in some implementations.

3. Embedding Dimension Trade-offs

The Scaling Landscape

The embedding dimension dmodeld_\text{model} is one of the most consequential hyperparameters in a transformer. It determines the width of the residual stream — the information highway that carries representations through every layer. All attention heads, MLP layers, and normalization layers operate on vectors of this dimension.

Across the history of transformer LLMs, dmodeld_\text{model} has grown dramatically:

Embedding Dimension vs. Model Size

(dimensions)
BERT-base (110M) 2018
768 dimensions
GPT-2 (1.5B) 2019
1,600 dimensions
+108.3%
Llama 7B 2023
4,096 dimensions
+433.3%
Llama 13B 2023
5,120 dimensions
+566.7%
Llama 70B 2023
8,192 dimensions
+966.7%
GPT-3 175B 2020
12,288 dimensions
+1500.0%

The Quality-Cost Trade-off

Increasing dmodeld_\text{model} improves model quality, but with diminishing returns. The empirical relationship, observed across multiple scaling law studies (Kaplan et al., 2020; Hoffmann et al., 2022), is that model loss decreases roughly as a power law in the total number of parameters, and dmodeld_\text{model} is the primary lever for increasing parameters in both the embedding layer and the transformer blocks.

The key trade-offs are:

Representational capacity: a larger dmodeld_\text{model} gives the model more dimensions to encode distinctions between tokens. With d=768d = 768, the model has 768 independent axes to separate 30,000+ tokens. With d=8192d = 8192, it has over 10x more axes. This matters especially for nuanced semantic distinctions — different senses of polysemous words, fine-grained entity types, and compositional meaning.

Memory cost of the embedding table: the embedding matrix has V×dmodelV \times d_\text{model} parameters. For the Llama 3 configuration (V=128,256V = 128{,}256, d=8192d = 8192), that is:

128,256×8,192=1,050,738,688 parameters128{,}256 \times 8{,}192 = 1{,}050{,}738{,}688 \text{ parameters}

At 2 bytes per parameter in BF16:

1,050,738,688×22.1 GB1{,}050{,}738{,}688 \times 2 \approx 2.1 \text{ GB}

This is a fixed cost — it does not depend on batch size or sequence length.

Compute cost: every layer performs matrix multiplications involving dmodeld_\text{model}-dimensional vectors. Attention computes QKQK^\top and αV\alpha V products; the MLP computes two dense projections through a hidden size of 4dmodel4d_\text{model} (or 83dmodel\frac{8}{3} d_\text{model} with SwiGLU). FLOPs scale as O(dmodel2)O(d_\text{model}^2) per token per layer for the MLP, and O(dmodels)O(d_\text{model} \cdot s) per token per layer for attention (where ss is the sequence length). Doubling dmodeld_\text{model} roughly quadruples MLP compute per layer.

Bandwidth cost: during inference, the model weights for each layer must be read from GPU HBM for every token generated. Larger dmodeld_\text{model} means larger weight matrices, which means more bytes transferred per token. On memory-bandwidth-bound workloads (which is most of autoregressive decoding), this directly increases latency.

The Practical Sweet Spot

The Chinchilla scaling laws (Hoffmann et al., 2022) established that model size and training data should be scaled together. In practice, the community has converged on a set of typical configurations:

📊

Embedding Dimension Sweet Spots

Model Scaled_modelVocab SizeEmbedding Table (BF16)Fraction of Total Params
1-3B 2048-3072 32K 128-192 MB ~15-20%
7-13B 4096-5120 32K-128K 0.5-1.3 GB ~8-15%
30-70B 6656-8192 32K-128K 0.9-2.1 GB ~2-5%
175B+ 12288+ 100K+ 2.5+ GB ~1-2%
Note: Embedding table fraction decreases with model size because transformer layers grow quadratically with d_model while embeddings grow linearly.

An important observation: the embedding table is a shrinking fraction of the total model as models get larger. For BERT-base, the embedding table is roughly 23% of total parameters. For Llama 70B, it is roughly 1.5%. This is because the transformer layers contain O(Ldmodel2)O(L \cdot d_\text{model}^2) parameters (from the attention and MLP weight matrices), which grow much faster than the O(Vdmodel)O(V \cdot d_\text{model}) embedding table.

Memory Rule of Thumb

Embedding table memory in BF16: V×dmodel×2V \times d_\text{model} \times 2 bytes. For Llama 3 (V=128,256V = 128{,}256, d=8192d = 8192): approximately 2 GB. This is a fixed overhead — it does not scale with batch size or sequence length, unlike the KV cache.


4. Weight Tying: One Matrix, Two Jobs

The Idea

A transformer LLM has two matrices that relate tokens to vectors. At the input side, the embedding matrix ERV×dE \in \mathbb{R}^{V \times d} maps token IDs to vectors. At the output side, the language model head (sometimes called the “unembedding” matrix) WoutRd×VW_\text{out} \in \mathbb{R}^{d \times V} maps the final hidden state back to logits over the vocabulary:

logits=hfinalWoutRV\text{logits} = h_\text{final} \cdot W_\text{out} \in \mathbb{R}^V

Without weight tying, EE and WoutW_\text{out} are separate parameters — the model has 2×V×d2 \times V \times d parameters just for token-to-vector and vector-to-token conversion. With weight tying, we set Wout=EW_\text{out} = E^\top (or equivalently, share the same underlying parameter matrix):

logits=hfinalE\text{logits} = h_\text{final} \cdot E^\top

This single change halves the parameter count of the embedding and output layers.

The Intuition

Weight tying rests on a simple and elegant intuition. The input embedding maps tokens to vectors based on their meaning: semantically similar tokens should have similar embedding vectors. The output projection maps hidden states to probability distributions over tokens: similar hidden states should assign similar probabilities to similar tokens.

These two desiderata are the same constraint expressed from opposite directions. If “cat” and “kitten” have similar embedding vectors E[cat]E[kitten]E[\text{cat}] \approx E[\text{kitten}], then for any hidden state hh, the dot products hE[cat]h \cdot E[\text{cat}] and hE[kitten]h \cdot E[\text{kitten}] will be similar — meaning the model assigns them similar output probabilities. This is exactly what we want.

The formal justification comes from Press and Wolf (2017), who showed that weight tying can be understood as imposing a symmetry constraint: the model’s notion of “what tokens mean” (input embedding) should be consistent with its notion of “what tokens to predict” (output projection).

Memory Savings

The savings are significant at modern scales:

def embedding_memory_savings(vocab_size: int, d_model: int, dtype_bytes: int = 2):
    """
    Calculate memory saved by weight tying.
    dtype_bytes=2 for BF16/FP16, =4 for FP32
    """
    params = vocab_size * d_model
    memory_bytes = params * dtype_bytes
    memory_gb = memory_bytes / (1024 ** 3)
    return {
        "params_saved": f"{params:,}",
        "memory_saved_gb": f"{memory_gb:.2f} GB",
    }

# Llama 3 8B:  V=128,256  d=4096
print(embedding_memory_savings(128_256, 4096))
# {'params_saved': '525,369,344', 'memory_saved_gb': '0.98 GB'}

# Llama 3 70B: V=128,256  d=8192
print(embedding_memory_savings(128_256, 8192))
# {'params_saved': '1,050,738,688', 'memory_saved_gb': '1.96 GB'}

# GPT-3 175B:  V=50,257   d=12288
print(embedding_memory_savings(50_257, 12288))
# {'params_saved': '617,558,016', 'memory_saved_gb': '1.15 GB'}

Memory Saved by Weight Tying

(GB)
BERT-base (V=30K, d=768) V=30,522 d=768
0.04 GB
GPT-2 (V=50K, d=1600) V=50,257 d=1,600
0.15 GB
+275.0%
Llama 3 8B V=128,256 d=4,096
0.98 GB
+2350.0%
Llama 3 70B V=128,256 d=8,192
1.96 GB
+4800.0%
Hypothetical (V=256K, d=8192) V=256,000 d=8,192
3.91 GB
+9675.0%

For Llama 3 70B, weight tying saves nearly 2 GB of memory. That 2 GB is not just disk space — it is 2 GB of GPU HBM that can instead hold KV cache for longer contexts, larger batches, or both.

Who Ties, Who Doesn’t

The decision is not universal:

Models that tie weights: GPT-2, GPT-Neo, Llama 2, Llama 3, BERT, T5, and most open-weight models in the Llama family. For these models, the lm_head.weight tensor is a reference to embed_tokens.weight — the same memory, the same gradients.

Models that do not tie: some models deliberately keep the matrices separate. The argument against tying is that the input and output tasks are subtly different. The input embedding is optimized to produce useful representations for all downstream layers. The output projection is optimized to separate tokens in logit space for next-token prediction. These objectives may conflict, especially for tokens that are semantically similar but syntactically distinct (e.g., “run” as a noun vs. “run” as a verb).

Quality impact: extensive experiments (Press and Wolf, 2017; the Llama 2 technical report) show that weight tying is generally neutral or mildly positive for model quality. The regularization effect of the shared constraint appears to help more than the capacity reduction hurts, at least at scales above a few billion parameters. Below 1B parameters, the capacity reduction can be significant since the embedding table is a larger fraction of total parameters.

💡 Implementation Detail

In PyTorch with weight tying, only one copy of the matrix exists in memory. During backward passes, gradients from both the embedding loss and the output loss are accumulated into the same parameter tensor. This means the embedding vectors receive gradient signal from two sources: the forward embedding path and the backward unembedding path.

The Code

Implementing weight tying is straightforward:

class TransformerLM(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, n_layers: int):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, d_model)
        self.layers = nn.ModuleList([
            TransformerBlock(d_model) for _ in range(n_layers)
        ])
        self.norm = nn.RMSNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying: share the embedding matrix
        self.lm_head.weight = self.embed_tokens.weight
        # Now lm_head.weight and embed_tokens.weight
        # point to the SAME tensor in memory.

    def forward(self, token_ids):
        # Input embedding: lookup
        x = self.embed_tokens(token_ids)  # [B, S, D]

        # Transformer layers
        for layer in self.layers:
            x = layer(x)
        x = self.norm(x)

        # Output projection: uses the SAME matrix as embedding
        logits = self.lm_head(x)  # [B, S, V]
        return logits

Note that nn.Embedding stores a matrix of shape [V, d_model], while nn.Linear(d_model, V, bias=False) stores a matrix of shape [V, d_model] (PyTorch stores the transposed form). This is why the tying works directly — both expect the same shape.


5. The Geometry of Embedding Space

Embeddings as Points in Space

Each token’s embedding vector is a point in dd-dimensional space. After training, the positions of these points encode the relationships the model has learned. Tokens that appear in similar contexts — and therefore have similar functional roles in language — end up in nearby regions of this space.

The standard measure of similarity in embedding space is cosine similarity:

cos(E[i],E[j])=E[i]E[j]E[i]E[j]\cos(E[i], E[j]) = \frac{E[i] \cdot E[j]}{\|E[i]\| \cdot \|E[j]\|}

Cosine similarity ranges from 1-1 (opposite directions) through 00 (orthogonal) to +1+1 (same direction). It ignores vector magnitude and measures only directional alignment, which makes it more stable than raw dot product for comparing embeddings with different norms.

Linear Substructure and Analogies

The most famous property of embedding spaces is their linear substructure. The Word2Vec paper (Mikolov et al., 2013) demonstrated that semantic relationships are often encoded as consistent vector offsets:

E[king]E[man]+E[woman]E[queen]E[\text{king}] - E[\text{man}] + E[\text{woman}] \approx E[\text{queen}]

This works because the vector E[king]E[man]E[\text{king}] - E[\text{man}] captures the concept of “royalty” independent of gender, and adding E[woman]E[\text{woman}] applies that concept to the female gender axis.

Similar linear relationships emerge for other semantic dimensions:

  • Tense: E[walked]E[walk]E[swam]E[swim]E[\text{walked}] - E[\text{walk}] \approx E[\text{swam}] - E[\text{swim}]
  • Plurality: E[cats]E[cat]E[dogs]E[dog]E[\text{cats}] - E[\text{cat}] \approx E[\text{dogs}] - E[\text{dog}]
  • Geography: E[Paris]E[France]E[Berlin]E[Germany]E[\text{Paris}] - E[\text{France}] \approx E[\text{Berlin}] - E[\text{Germany}]

Why does this happen? The model learns that these relationships are predictively useful. If knowing the tense transformation helps predict the next token in one context, the model encodes that transformation as a consistent direction in embedding space, because a consistent direction can be exploited by a single linear operation in the attention or MLP layers.

ℹ️ Linearity Is Not Guaranteed

The analogy property is strongest in static word embeddings (Word2Vec, GloVe) and in the embedding layer of transformers. In later layers, representations become increasingly contextualized and nonlinear. The embedding layer captures distributional similarity; later layers capture contextual meaning.

Isotropy and the Representation Degeneration Problem

In an ideal embedding space, vectors would be distributed roughly uniformly across all directions — this property is called isotropy. An isotropic space uses its full dimensionality efficiently: every direction carries information.

In practice, trained embedding spaces are highly anisotropic. Ethayarajh (2019) showed that:

  1. Embeddings cluster in a narrow cone: in BERT, GPT-2, and ELMo, the average cosine similarity between random token pairs is far above zero — often 0.5 or higher. This means most embeddings point in roughly the same direction, and the space is not being used efficiently.

  2. Later layers are worse: as representations pass through more transformer layers, they become even more anisotropic. The final layer’s representations are almost all in the same narrow region of the space.

  3. Frequency drives the effect: high-frequency tokens (articles, prepositions, punctuation) dominate the principal components of the embedding matrix, skewing the entire space toward a small number of directions.

This is the representation degeneration problem. The model has dd dimensions available but effectively uses far fewer. One cause is the softmax bottleneck in the output layer: the model must assign high probability to the correct next token, which pushes the output layer’s weight vectors (and hence, with weight tying, the embeddings) toward configurations that maximize the dynamic range of logits, at the cost of uniform coverage of the space.

Why Later-Layer Representations Are More Useful

A common misconception is that the embedding layer produces the “best” representations of tokens. In fact, the opposite is true. The embedding layer provides only distributional information — tokens that appear in similar contexts get similar vectors, regardless of their specific meaning in the current sentence.

Consider the word “bank” in:

  • “I deposited money at the bank”
  • “I sat on the river bank”

The embedding layer produces the same vector for “bank” in both sentences — it is a context-free representation. It is only after passing through multiple transformer layers, where attention allows “bank” to attend to “money” or “river,” that the representation becomes context-dependent and disambiguated.

For downstream tasks like classification, search, or retrieval, representations from the middle or final layers of the transformer consistently outperform raw embedding layer representations. The embedding layer is the starting point — not the destination.


6. Token Frequency Effects

The Frequency-Norm Relationship

Not all tokens are created equal in embedding space. There is a strong, consistent relationship between a token’s frequency in the training data and the properties of its learned embedding vector.

High-frequency tokens (like “the,” “is,” “of,” “and”) receive far more gradient updates during training. They appear in nearly every batch, and their embeddings are refined by millions of gradient steps. As a result:

  • Their embeddings have larger L2 norms. The model has high confidence about where these tokens belong in the space, and their vectors move far from the origin.
  • They cluster tightly in a relatively small region of the space. Since these tokens appear in extremely diverse contexts, their embeddings converge to represent a kind of “average” that is broadly compatible with many contexts.
  • They dominate the principal components of the embedding matrix. If you compute the singular value decomposition (SVD) of the embedding matrix, the top singular vectors are strongly aligned with high-frequency token embeddings.

Low-frequency tokens (technical terms, rare names, unusual words) receive far fewer gradient updates:

  • Their embeddings have smaller L2 norms, reflecting less cumulative gradient magnitude.
  • They are scattered more widely in the space, often in less-populated regions. The model has less information about where they should be, so they drift.
  • They are less semantically precise. With fewer gradient updates, the model has had fewer opportunities to learn fine-grained distinctions between rare tokens.
📊

Token Frequency Effects on Embedding Properties

Token CategoryFrequency RankAvg L2 NormAvg Cosine to CentroidGradient Updates (relative)
Articles (the, a, an) Top 10 High (8-12) 0.85+ 1.0x (baseline)
Common verbs (is, was, have) Top 100 High (6-10) 0.75-0.85 ~0.3x
Domain words (algorithm, tensor) Top 10K Medium (3-6) 0.50-0.70 ~0.01x
Rare tokens (xylophone, zygote) 50K+ Low (1-3) 0.20-0.45 ~0.0001x
Byte fallback tokens Rarely used Very low (0.5-1.5) Near random ~0.00001x
Note: Values are illustrative based on empirical studies of GPT-2 and Llama 2 embeddings. Exact values vary by model and training data.

Implications for Model Behavior

This frequency-norm structure has direct consequences for model behavior:

Rare tokens are harder to use correctly. When the model encounters a rare token, its embedding is a weak, low-norm signal that provides less useful information to the attention mechanism. The model must rely more on surrounding context to disambiguate meaning, which is why LLMs sometimes produce incorrect or generic outputs for unusual words.

Byte-level fallback tokens are nearly meaningless. Models with byte-pair encoding (BPE) tokenizers have fallback byte tokens (individual UTF-8 bytes) for characters not covered by the learned vocabulary. These tokens are used extremely rarely — only for unusual Unicode characters or corrupted text. Their embeddings receive almost no training signal and are essentially random directions in embedding space. This is one reason why LLMs struggle with unusual character encodings and non-Latin scripts.

Subword tokens inherit partial meaning. A word like “unbelievable” might be tokenized as [“un”, “believ”, “able”]. Each subword token has its own embedding vector. The prefix “un” has a strong embedding (it appears frequently as a negation prefix), but “believ” is less common as a standalone subword. The model must compose the full word’s meaning across multiple token embeddings through attention, which is why compositional understanding of morphology is imperfect in current LLMs.

⚠️ Rare Token Degradation

If a token appears fewer than ~100 times in the training corpus, its embedding vector may be essentially random. This affects model performance on technical jargon, proper nouns from underrepresented domains, and code identifiers. Expanding the vocabulary does not always help — adding more rare tokens creates more underfit embeddings.


7. Positional Information in Embeddings

The Position Problem

The embedding layer maps each token ID to a fixed vector, regardless of where that token appears in the sequence. But position matters enormously in language — “dog bites man” and “man bites dog” contain the same tokens in different positions with completely opposite meanings. The transformer’s self-attention mechanism is permutation-invariant (as we discussed in Part 1), so without explicit positional information, the model treats the input as a bag of tokens.

Some mechanism must inject position information. Historically, this was done by adding a separate positional embedding to each token embedding.

Learned Absolute Position Embeddings (GPT-2 Style)

The simplest approach maintains a second embedding table EposRLmax×dE_\text{pos} \in \mathbb{R}^{L_\text{max} \times d}, where LmaxL_\text{max} is the maximum sequence length the model can handle. At position pp, the input to the first transformer layer is the sum of the token embedding and the position embedding:

hp(0)=Etok[tokenp]+Epos[p]h_p^{(0)} = E_\text{tok}[\text{token}_p] + E_\text{pos}[p]

class GPT2Embedding(nn.Module):
    """GPT-2 style: token embedding + learned absolute position embedding."""
    def __init__(self, vocab_size: int, max_len: int, d_model: int):
        super().__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_len, d_model)  # Separate table

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        B, S = token_ids.shape
        positions = torch.arange(S, device=token_ids.device)  # [0, 1, ..., S-1]
        tok_emb = self.tok_embed(token_ids)         # [B, S, D]
        pos_emb = self.pos_embed(positions)          # [S, D] -> broadcast to [B, S, D]
        return tok_emb + pos_emb

This adds Lmax×dL_\text{max} \times d parameters. For GPT-2 (Lmax=1024L_\text{max} = 1024, d=1600d = 1600), that is 1.6M extra parameters — negligible. For a hypothetical model with Lmax=128,000L_\text{max} = 128{,}000 and d=8192d = 8192, that is over 1 billion parameters — significant.

The Extrapolation Problem

Learned absolute position embeddings have a critical flaw: the model has never seen a gradient for positions beyond LmaxL_\text{max}. The embedding table simply has no row for position Lmax+1L_\text{max} + 1. If you try to extend the context length after training, the model encounters a position embedding it has never learned, and performance degrades catastrophically.

This is not a theoretical concern. GPT-2 was limited to 1024 tokens. Early GPT-3 was limited to 2048 tokens, later extended to 4096. Each extension required careful fine-tuning.

The Transition to RoPE and ALiBi

Modern decoder-only models have largely abandoned learned absolute position embeddings in favor of methods that encode relative position directly in the attention computation:

  • RoPE (Rotary Position Embeddings): used by Llama 2, Llama 3, Mistral, and most modern open-weight models. RoPE encodes position by rotating the query and key vectors before the dot product, so the attention score between positions ii and jj depends only on iji - j. This enables much better length extrapolation.

  • ALiBi (Attention with Linear Biases): used by BLOOM and MPT. ALiBi adds a position-dependent bias directly to attention scores, with no learned parameters. This allows zero-cost length extrapolation.

Both approaches are applied within the attention mechanism, not at the embedding layer. This means the embedding layer in modern models like Llama 3 contains only token embeddings — no positional embeddings at all. The position signal is injected later, in each attention layer.

We cover RoPE and ALiBi in depth in Part 4 of this series.

Segment and Type Embeddings

BERT introduced an additional embedding: the segment embedding (also called type embedding). BERT processes pairs of sentences for tasks like natural language inference, and the segment embedding tells the model which sentence each token belongs to:

hp(0)=Etok[tokenp]+Epos[p]+Eseg[segmentp]h_p^{(0)} = E_\text{tok}[\text{token}_p] + E_\text{pos}[p] + E_\text{seg}[\text{segment}_p]

The segment embedding table is tiny — just 2 rows (sentence A and sentence B) of dimension dd. This idea has been largely abandoned in decoder-only models, which process a single continuous sequence and have no need for segment boundaries. Instruction-following models (like ChatGPT or Llama-Chat) use special tokens ([INST], <|im_start|>) to demarcate roles instead.

ℹ️ Modern Embedding Layers Are Simpler

In GPT-2 and BERT, the embedding layer combines token + position (+ segment) embeddings by addition. In Llama 2, Llama 3, Mistral, and most modern decoder-only models, the embedding layer is just the token embedding lookup. Positional information is handled entirely by RoPE within the attention layers. This simplification is one of the underappreciated differences between old and new architectures.


8. Input/Output Shapes and Memory

The Complete Data Flow

Let us trace the exact shapes and memory costs of the embedding layer in a concrete production setting. Consider Llama 3 8B with a batch of 8 sequences, each 4096 tokens long:

Step 1: Token IDs

The tokenizer produces integer IDs:

token_ids:[B,S]=[8,4096]\text{token\_ids}: [B, S] = [8, 4096]

These are 32-bit integers (4 bytes each):

8×4,096×4=131,072 bytes=128 KB8 \times 4{,}096 \times 4 = 131{,}072 \text{ bytes} = 128 \text{ KB}

Step 2: Embedding Lookup

The embedding layer indexes into the table and produces dense vectors:

embeddings:[B,S,D]=[8,4096,4096]\text{embeddings}: [B, S, D] = [8, 4096, 4096]

In BF16 (2 bytes per element):

8×4,096×4,096×2=268,435,456 bytes=256 MB8 \times 4{,}096 \times 4{,}096 \times 2 = 268{,}435{,}456 \text{ bytes} = 256 \text{ MB}

Step 3: The Embedding Table Itself

The table is a fixed cost, independent of batch size and sequence length:

E:[V,D]=[128,256,4096]E: [V, D] = [128{,}256, 4096]

In BF16:

128,256×4,096×2=1,050,673,152 bytes1.0 GB128{,}256 \times 4{,}096 \times 2 = 1{,}050{,}673{,}152 \text{ bytes} \approx 1.0 \text{ GB}

Embedding Layer Memory Breakdown (Llama 3 8B)

Batch=8, Sequence=4096, V=128,256, d_model=4096, BF16

0x4000 0x0000
0x5000 0x4000
0x5001 0x5000
Embedding Table (V x D) 1.0 GB
Batch Embeddings (B x S x D) 256 MB
Token IDs (B x S) 128 KB
128,256 x 4,096 x 2 bytes. Fixed cost, independent of batch/seq.
8 x 4,096 x 4,096 x 2 bytes. Scales with batch and sequence.
8 x 4,096 x 4 bytes. Negligible.
Embedding Table (V x D) 1.0 GB
Batch Embeddings (B x S x D) 256 MB
Token IDs (B x S) 128 KB

Scaling to Llama 3 70B

Now consider the larger model:

📊

Embedding Memory at Different Scales

ComponentLlama 3 8B (d=4096)Llama 3 70B (d=8192)Scaling Factor
Embedding Table (V x D) 1.0 GB 2.0 GB 2x (linear in d)
Batch Embeddings (B=8, S=4096) 256 MB 512 MB 2x (linear in d)
Batch Embeddings (B=32, S=4096) 1.0 GB 2.0 GB 2x (linear in d)
Batch Embeddings (B=8, S=128K) 8.0 GB 16.0 GB 2x (linear in d)
Token IDs 128 KB 128 KB 1x (independent of d)
Note: With weight tying, the embedding table and output projection share memory, so these costs are not doubled. Without tying, double the embedding table row.

At long contexts, the batch embeddings dominate. A single batch of 8 sequences at 128K context with d=8192d = 8192 requires 16 GB just for the embedding output — before a single attention layer has executed. In practice, this memory is reused: the embedding output is consumed by the first layer and can be freed before the second layer runs. But it sets the floor for peak memory usage at the embedding stage.

The Full Picture: Where Embedding Memory Fits

To put embedding costs in perspective, here is how they compare to other memory consumers during inference:

GPU Memory During Inference (Llama 3 70B on 2x H100)

Batch=8, Sequence=4096, BF16, with weight tying

0x8800 0x0000
0x8C00 0x8800
0xAC00 0x8C00
0xB400 0xAC00
0xB800 0xB400
Model Weights (all layers) ~136 GB
Embedding Table (tied) 2.0 GB
KV Cache ~32 GB
Activations + Workspace ~8 GB
Framework Overhead ~4 GB
Attention + MLP + norms across 80 layers
Shared between input and output; V=128K, d=8192
80 layers x 8 KV heads x 128 dim x 4096 seq x 8 batch x 2 bytes
Intermediate tensors, optimizer states if training
CUDA context, NCCL, PyTorch allocator
Model Weights (all layers) ~136 GB
Embedding Table (tied) 2.0 GB
KV Cache ~32 GB
Activations + Workspace ~8 GB
Framework Overhead ~4 GB

The embedding table is roughly 2 GB out of ~182 GB total — about 1.1% of the memory budget. The KV cache is 16x larger. This is why most inference optimization efforts focus on the KV cache (GQA, quantized KV, paged attention) rather than the embedding layer. But for resource-constrained deployments — edge devices, mobile, or very large vocabularies — the embedding table is a meaningful target for compression via quantization or vocabulary pruning.


Putting It All Together

Let us trace the complete journey of a token through the embedding layer, from integer ID to dense vector ready for the first transformer layer:

  1. Tokenization (covered in Part 2): raw text is split into tokens and mapped to integer IDs. “The cat sat” becomes [464, 3797, 3332].

  2. Embedding lookup: each integer ID indexes into the embedding table ERV×dE \in \mathbb{R}^{V \times d}, producing a dense vector. The output shape is [batch, seq_len, d_model].

  3. Position encoding (modern models): in GPT-2, a learned positional embedding is added to the token embedding. In Llama 3, no positional embedding is added here — RoPE is applied later inside each attention layer.

  4. Normalization (some models): some architectures apply an initial layer normalization or RMSNorm to the embedding output before the first transformer layer. Llama 3 does not; the raw embedding vectors enter the first layer directly.

The result is a tensor of continuous vectors, one per token, that carries both the identity of each token (from the learned embedding) and, in older architectures, its position. This tensor is the input to the transformer stack — the starting point for all the attention and computation that follows.

💡 What Comes Next

In Part 4, we examine positional encoding in depth — how RoPE encodes position through complex-valued rotations, how ALiBi adds linear bias without any parameters, and how modern context extension techniques allow models trained on 4K tokens to generalize to 128K and beyond.


Summary

The embedding layer is deceptively simple — a matrix lookup — but it encodes a set of deep design decisions that affect every aspect of model performance:

  • The lookup table maps discrete token IDs to continuous vectors, enabling gradient-based learning. It is mathematically equivalent to a one-hot encoding followed by a linear projection, but implemented as an efficient index operation.

  • Initialization (typically N(0,0.022)\mathcal{N}(0, 0.02^2)) sets the starting point for learning. Poor initialization causes representation collapse or exploding activations. Scaled variants like σ/2L\sigma / \sqrt{2L} account for depth.

  • Embedding dimension (dmodeld_\text{model}) is the fundamental width of the model. It trades off representational capacity against memory and compute cost, with empirical sweet spots at d=4096d = 4096 for 7-13B models and d=8192d = 8192 for 70B+ models.

  • Weight tying shares the embedding matrix between input and output, saving up to 2 GB at modern scales with no quality loss. Most production LLMs use it.

  • Embedding geometry reveals what the model has learned: linear substructure encodes semantic analogies, but anisotropy and frequency effects mean the space is used inefficiently.

  • Token frequency creates a hierarchy: common tokens have large, precise embeddings; rare tokens have small, imprecise embeddings. This directly impacts model reliability on unusual inputs.

  • Positional embeddings were historically added at the embedding layer (GPT-2) but are now handled separately via RoPE or ALiBi in the attention layers (Llama 3, Mistral), simplifying the embedding layer to a pure token lookup.

  • Memory costs are dominated by the embedding table (V×d×2V \times d \times 2 bytes), which is a fixed overhead. At modern scales, the embedding table is 1-2% of total model memory — significant for edge deployment, negligible relative to the KV cache.

The embedding layer transforms integers into geometry. Everything the transformer learns — attention patterns, factual knowledge, reasoning ability — is built on this geometric foundation.