Tokenization and BPE: How LLMs See Text — From Characters to Subwords

Part of Series Transformer Anatomy 2 of 23

1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

In Part 1 of this series, we dissected the attention mechanism — how queries match against keys, how softmax produces a probability distribution over positions, how the $O(n^2)$ cost in sequence length dominates everything at long contexts. But we skipped a question that logically comes first: how does raw text become the sequence of token embeddings that attention operates on?

The answer is tokenization, and it is far less innocent than it looks. The tokenizer is the very first thing that touches your input and the very last thing that produces your output. Every decision it makes — how it splits words, how large its vocabulary is, whether it has seen a particular script before — silently propagates through the entire model. Tokenization determines sequence length, which determines attention cost. Tokenization determines vocabulary size, which determines embedding table memory. Tokenization determines how the model “sees” arithmetic, code, multilingual text, and rare words.

This post builds tokenization from first principles. We will start with why the obvious approaches (characters and words) fail, derive Byte Pair Encoding from scratch with a worked example, compare the three dominant tokenizer implementations, analyze the vocabulary size tradeoff with real numbers, catalog the failure modes that trip up practitioners, and connect tokenization directly to inference cost.

Part 1: Why Not Characters? Why Not Words?

Before subword tokenization existed, there were two obvious strategies: tokenize at the character level or tokenize at the word level. Both fail, but understanding exactly how they fail reveals why subwords are the right answer.

Character-Level Tokenization

The simplest possible tokenizer maps each character to a token. The English alphabet gives you roughly 26 lowercase + 26 uppercase + 10 digits + punctuation — about 100 tokens in total. For Unicode, you might need a few thousand entries to cover the common scripts, but the vocabulary stays small and finite.

The appeal is obvious: every possible string can be represented. There are no unknown words, no out-of-vocabulary problems, no language-specific rules. The tokenizer is trivially simple — just iterate through characters.

The problem is sequence length. The sentence “The transformer architecture revolutionized natural language processing” is 68 characters but only 8 words. At the character level, the attention mechanism must process an $68 \times 68 = 4{,}624$ element attention matrix. At the word level, that same matrix is $8 \times 8 = 64$ elements — a 72x reduction.

Recall from Part 1 that attention cost scales as $O(n^2)$ where $n$ is sequence length. Character-level tokenization inflates $n$ by roughly 5-6x compared to word-level tokenization for English text. That means attention cost increases by 25-36x. For a 4,096-token context window at the word level, you would need a 20,000-25,000-token context window at the character level to represent the same amount of text — and the attention cost would be 25-36x higher.

There is a second, subtler problem. Characters carry very little semantic information individually. The letter “t” means nothing by itself — the model must learn to compose characters into meaningful units across many positions. This means the model needs more layers and more capacity to learn the same linguistic patterns that a word-level model picks up easily. Empirically, character-level language models require significantly more compute to reach the same perplexity as word-level models on English text.

ℹ️ The Sequence Length Tax

Character-level tokenization does not just make sequences longer — it makes them quadratically more expensive. A 5x increase in sequence length produces a 25x increase in attention FLOPs. This is the fundamental reason character-level models never scaled to the sizes we see today.

Word-Level Tokenization

The opposite extreme: split on whitespace and punctuation, and assign each unique word a token ID. English text compresses well — “The transformer architecture” becomes just 3 tokens. Sequences are short, attention matrices are small, and each token carries rich semantic information.

The problem is the vocabulary. The English language has over 170,000 words in current use, but that number explodes once you include proper nouns, technical terms, compound words, misspellings, morphological variants, and other languages. “running”, “runs”, “ran”, “runner”, “runners” are all separate vocabulary entries despite sharing the root “run”. Code identifiers like getMaxBatchSize or num_attention_heads are unique words that appear rarely.

A word-level vocabulary of 170,000 entries creates two problems. First, the embedding table has 170,000 rows, each of dimension $d_{\text{model}}$ . For $d_{\text{model}} = 4096$ , that is $170{,}000 \times 4096 \times 2 = 1.33$ GB in FP16 — a substantial fraction of a 7B-parameter model’s total memory. Second, and more critically, any word not in the vocabulary is mapped to a single [UNK] (unknown) token, destroying all information about what that word actually was. A medical model encountering “hydroxychloroquine” for the first time would see it as [UNK] — identical to every other unknown word.

The open vocabulary problem is unsolvable at the word level. No finite word vocabulary can cover all possible inputs. Languages with productive morphology (German, Finnish, Turkish) can generate compound words that have never appeared in any training corpus. Code, URLs, and mathematical expressions contain arbitrary symbol sequences. Even restricting to English, new words are coined constantly.

The Subword Sweet Spot

Subword tokenization resolves the tension between these two extremes. The core idea: maintain a vocabulary of common subword units — some are full words, some are word fragments, some are individual characters. Common words like “the” and “is” get their own tokens. Uncommon words are decomposed into known subword pieces.

The word “understanding” might be tokenized as a single token if it is common enough, or split into “under” + “standing”, or further into “under” + “stand” + “ing”. The word “hydroxychloroquine” might become “hydro” + “xy” + “chlor” + “o” + “quine”. Every possible string can be represented because individual bytes or characters serve as a fallback — the vocabulary is effectively closed (no [UNK] needed) while remaining finite and manageable.

The tradeoff is explicit and tunable: vocabulary size controls the balance between sequence length and embedding table size. A 32K vocabulary produces slightly longer sequences than a 128K vocabulary but requires a 4x smaller embedding table. This is a knob you can turn, and different models turn it to different positions based on their priorities.

📊

Tokenization Strategy Comparison

Strategy	Vocab Size	Seq Length (English)	OOV Handling	Multilingual
Character-level	~100-300	5-6x longer	No OOV (complete)	Good coverage
Word-level	170K+	1x (baseline)	UNK token (lossy)	Requires per-language vocab
Subword (BPE, 32K)	32,000	~1.3x	Decomposition (lossless)	Moderate
Subword (BPE, 128K)	128,000	~1.0x	Decomposition (lossless)	Good

Note: Sequence length is relative to word-level. Subword tokenization provides lossless encoding with tunable vocabulary size.

Part 2: Byte Pair Encoding from Scratch

Byte Pair Encoding (BPE) is the dominant subword tokenization algorithm. It was originally a data compression technique invented by Philip Gage in 1994, adapted for NLP by Sennrich et al. in 2016. The mental model is this: BPE finds the most compressible representation of the training corpus within a given vocabulary budget. It greedily merges the most frequent character pairs until the vocabulary reaches the target size.

The Algorithm

The BPE training algorithm is beautifully simple:

Start with a vocabulary containing all individual characters (or bytes) that appear in the training corpus.
Count the frequency of every adjacent pair of tokens in the corpus.
Merge the most frequent pair into a single new token. Add it to the vocabulary.
Replace all occurrences of that pair in the corpus with the new token.
Repeat from step 2 until the vocabulary reaches the desired size.

That is the entire algorithm. Let us walk through it on a concrete example.

Worked Example

Consider a small training corpus consisting of repeated occurrences of these words (with frequencies):

"low"     : 5 times
"lower"   : 2 times
"newest"  : 6 times
"widest"  : 3 times

Step 0: Initialize character vocabulary

We split every word into characters and add a special end-of-word symbol _ to mark word boundaries. The initial representation is:

"l o w _"     : 5
"l o w e r _" : 2
"n e w e s t _" : 6
"w i d e s t _" : 3

Initial vocabulary: {l, o, w, e, r, n, s, t, i, d, _} (11 tokens)

Iteration 1: Count all adjacent pairs

Pair frequencies:
  (e, s) : 6 + 3 = 9    <- from "newest" and "widest"
  (s, t) : 6 + 3 = 9    <- from "newest" and "widest"
  (t, _) : 6 + 3 = 9    <- from "newest" and "widest"
  (l, o) : 5 + 2 = 7    <- from "low" and "lower"
  (o, w) : 5 + 2 = 7    <- from "low" and "lower"
  (n, e) : 6             <- from "newest"
  (e, w) : 6             <- from "newest"
  (w, e) : 2 + 6 = 8    <- from "lower" and "newest"
  (w, _) : 5             <- from "low"
  (w, i) : 3             <- from "widest"
  (i, d) : 3             <- from "widest"
  (d, e) : 3             <- from "widest"
  (e, r) : 2             <- from "lower"
  (r, _) : 2             <- from "lower"

Three pairs tie at frequency 9: (e, s), (s, t), and (t, _). We pick (e, s) (tie-breaking by order).

Merge: e + s becomes the new token es.

"l o w _"       : 5
"l o w e r _"   : 2
"n e w es t _"  : 6
"w i d es t _"  : 3

Vocabulary: {l, o, w, e, r, n, s, t, i, d, _, es} (12 tokens)

Iteration 2: Recount pairs

The most frequent pair is now (es, t) with frequency 9 (from “newest” and “widest”).

Merge: es + t becomes est.

"l o w _"       : 5
"l o w e r _"   : 2
"n e w est _"   : 6
"w i d est _"   : 3

Vocabulary: 13 tokens (added est)

Iteration 3: The most frequent pair is (est, _) with frequency 9.

Merge: est + _ becomes est_.

"l o w _"       : 5
"l o w e r _"   : 2
"n e w est_"    : 6
"w i d est_"    : 3

Vocabulary: 14 tokens (added est_)

Iteration 4: Now (l, o) and (o, w) both have frequency 7. Pick (l, o).

Merge: l + o becomes lo.

"lo w _"        : 5
"lo w e r _"    : 2
"n e w est_"    : 6
"w i d est_"    : 3

Vocabulary: 15 tokens (added lo)

Iteration 5: (lo, w) has frequency 7.

Merge: lo + w becomes low.

"low _"         : 5
"low e r _"     : 2
"n e w est_"    : 6
"w i d est_"    : 3

Vocabulary: 16 tokens (added low)

Iteration 6: (low, _) has frequency 5, while (n, e), (e, w), and (w, est_) each have frequency 6. The highest is (e, w) at frequency 6.

Merge: e + w becomes ew.

"low _"         : 5
"low e r _"     : 2
"n ew est_"     : 6
"w i d est_"    : 3

Vocabulary: 17 tokens (added ew)

After just 6 iterations, observe what has happened. The suffix “-est” has been captured as a single token est_ — the algorithm discovered that this is a common English morphological pattern. The prefix “low” has been captured as a single token. The algorithm did not know anything about English morphology; it simply followed the statistics of character co-occurrence.

If we continued, the next merges would likely produce new from (n, ew), then newest_ from (new, est_). The algorithm would eventually discover that “low_” and “lower” share a common prefix, and might create tokens for common suffixes like “-er”, “-ing”, “-tion”.

The Compression Mental Model

BPE is fundamentally a compression algorithm. Each merge reduces the total number of tokens in the corpus by replacing two tokens with one. The merge that reduces the corpus length the most is the one that replaces the most frequent pair — which is exactly what BPE selects.

After $k$ merges, the vocabulary has grown by $k$ entries and the corpus has been compressed by approximately $k \times f_k$ tokens, where $f_k$ is the frequency of the pair merged at step $k$ . The algorithm is greedy: it picks the locally optimal merge at each step without considering future consequences. This greedy strategy works remarkably well in practice because high-frequency pairs tend to remain useful as the vocabulary grows.

The vocabulary budget is the stopping condition. If you want a 32K vocabulary and your initial character set has 256 entries (byte-level BPE), you perform $32{,}000 - 256 = 31{,}744$ merge operations. Each merge adds one token to the vocabulary and makes the corpus slightly shorter.

💡 BPE as Compression

The key mental model: a BPE vocabulary of size $V$ is the set of $V$ substrings that gives the best compression of the training corpus under a greedy merging strategy. Frequent substrings get their own tokens (short encoding), rare substrings are decomposed into characters (long encoding). This is exactly analogous to Huffman coding — common patterns get short codes.

BPE Encoding (Inference Time)

Training produces a merge table — an ordered list of merge rules. To tokenize a new string at inference time:

Split the string into characters (or bytes).
Apply merge rules in priority order: find the highest-priority merge that can be applied, apply it, and repeat.
Continue until no more merges are applicable.

def bpe_encode(text, merge_rules):
    """
    Apply BPE encoding to a string given trained merge rules.
    merge_rules: list of (pair, merged_token) in priority order
    """
    # Start with character-level tokens
    tokens = list(text)

    for pair, merged in merge_rules:
        i = 0
        new_tokens = []
        while i < len(tokens):
            # Look for the pair at position i
            if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
                new_tokens.append(merged)
                i += 2  # Skip both tokens in the pair
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens

    return tokens

This is $O(n \times m)$ where $n$ is the string length and $m$ is the number of merge rules. In practice, optimized implementations use hash tables and priority queues to bring this close to $O(n \log n)$ .

Part 3: SentencePiece vs tiktoken vs HuggingFace Tokenizers

Three implementations dominate the LLM ecosystem. They all implement BPE (and sometimes alternative algorithms), but differ in their pre-tokenization strategy, implementation language, and design philosophy.

SentencePiece

SentencePiece, developed by Google, is the most language-agnostic of the three. Its defining feature: it operates on raw Unicode strings without any language-specific pre-tokenization. Where other tokenizers split on whitespace first and then apply BPE within each word, SentencePiece treats the entire input — including spaces — as a flat sequence of characters. Spaces are represented as the Unicode character \u2581 (lower one eighth block, visually rendered as _), making them explicit tokens that participate in BPE merges.

This design has a crucial advantage: it works identically for languages that do not use whitespace word boundaries (Chinese, Japanese, Thai). A whitespace-based pre-tokenizer would treat an entire Chinese sentence as a single “word”, making BPE within it behave very differently from English. SentencePiece avoids this asymmetry entirely.

SentencePiece also implements the Unigram model as an alternative to BPE. Instead of building a vocabulary bottom-up through merges, the Unigram model starts with a large initial vocabulary and iteratively removes tokens that contribute least to the corpus likelihood. The Unigram model optimizes a probabilistic objective (maximizing the likelihood of the training data under a unigram language model over subword units) and can find more globally optimal vocabularies than BPE’s greedy approach.

Used by: Llama 1, Llama 2, T5, ALBERT, XLNet, Mistral (early versions)

tiktoken

tiktoken is OpenAI’s tokenizer library, purpose-built for the GPT family of models. It implements byte-level BPE with regex-based pre-tokenization. Before BPE merges are applied, tiktoken uses a regex pattern to split the input into coarse chunks:

# Simplified version of GPT-4's pre-tokenization pattern
import regex
pattern = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

This regex splits text into words, numbers (in groups of up to 3 digits), punctuation, and whitespace. BPE is then applied independently within each chunk. The pre-tokenization prevents BPE from merging across word boundaries — the token “the” can never merge with a following space and the first letter of the next word.

tiktoken is implemented in Rust with Python bindings, making it extremely fast. Its byte-level approach means the base vocabulary is exactly 256 entries (one per byte value), ensuring that any byte sequence can be encoded — no [UNK] token is ever needed.

Used by: GPT-2, GPT-3, GPT-3.5, GPT-4, GPT-4o

HuggingFace Tokenizers

The HuggingFace tokenizers library is the Swiss Army knife of tokenization. It provides a unified API for BPE, WordPiece (used by BERT), Unigram, and character-level models. The core is written in Rust for speed, with bindings for Python and Node.js.

Its modular architecture separates the pipeline into distinct stages: normalization (Unicode NFKC, lowercasing), pre-tokenization (whitespace splitting, byte-level mapping), the core model (BPE, WordPiece, Unigram), and post-processing (adding special tokens, building attention masks). Each stage is independently configurable.

The library’s primary strength is flexibility and ecosystem integration. It can load tokenizers from any HuggingFace model repository, train new tokenizers on custom data, and convert between formats. For production inference, it provides the same Rust-powered speed as tiktoken.

Used by: Most open-source models via the HuggingFace ecosystem (Llama 3, Qwen, Falcon, MPT, etc.)

📊

Tokenizer Implementation Comparison

Feature	SentencePiece	tiktoken	HuggingFace Tokenizers
Language	C++	Rust + Python	Rust + Python
Pre-tokenization	None (whitespace as token)	Regex-based	Configurable
Algorithms	BPE, Unigram	Byte-level BPE	BPE, WordPiece, Unigram
Base vocabulary	Unicode codepoints	256 bytes	Configurable
CJK handling	Native (no whitespace bias)	Via regex rules	Configurable
Speed (English, 1MB)	~120 ms	~45 ms	~55 ms
Ecosystem	Google models	OpenAI models	Most open-source models

Note: Speed benchmarks are approximate, measured on a single core. All three are fast enough for most applications.

When to Use Each

Use SentencePiece when training a model from scratch on heavily multilingual data, especially if your corpus includes significant CJK text. Its language-agnostic design avoids English-centric biases. Also use it when you want to experiment with the Unigram algorithm.

Use tiktoken when working with OpenAI models or when you need the fastest possible encoding speed. Its byte-level design is clean and predictable. It is also the right choice if you are building a tokenizer for a new model and want the simplest, most robust design.

Use HuggingFace Tokenizers when working with the broader open-source ecosystem, when you need to load pre-trained tokenizers from model repositories, or when you need maximum configurability over the tokenization pipeline.

Part 4: Vocabulary Size Tradeoffs

The vocabulary size is one of the most consequential hyperparameters in LLM design, yet it receives far less attention than model depth, width, or learning rate. It controls a three-way tradeoff between sequence length, embedding table size, and multilingual coverage.

The Core Tradeoff

A larger vocabulary means more subword units, which means common sequences get their own dedicated tokens instead of being split into pieces. This produces shorter token sequences for the same text — better compression. But each vocabulary entry requires a row in the embedding table ( $d_{\text{model}}$ parameters) and a row in the output projection / language model head ( $d_{\text{model}}$ parameters). The embedding table for a 128K vocabulary at $d_{\text{model}} = 4096$ is $128{,}000 \times 4096 = 524$ million parameters — about 7.5% of a 7B-parameter model.

Compression Ratio Across Languages

The most dramatic impact of vocabulary size is on multilingual performance. English text compresses well even with small vocabularies because English words are short and the Latin alphabet is compact. CJK languages (Chinese, Japanese, Korean) and languages with complex scripts (Arabic, Hindi, Thai) require many more bytes per character and have much larger character sets. A vocabulary trained primarily on English data will assign single tokens to common English words but split CJK characters into multiple byte-level tokens.

📊

Tokens per 1,000 Characters by Vocabulary Size

Language	32K vocab (GPT-2)	50K vocab (Llama 2)	100K vocab (GPT-4)	128K vocab (Llama 3)
English	~280 tokens	~265 tokens	~240 tokens	~230 tokens
Chinese	~700 tokens	~550 tokens	~350 tokens	~320 tokens
Japanese	~650 tokens	~520 tokens	~330 tokens	~300 tokens
Python code	~350 tokens	~320 tokens	~270 tokens	~250 tokens
Arabic	~580 tokens	~470 tokens	~310 tokens	~280 tokens

Note: Approximate values. Lower is better — fewer tokens means shorter sequences, less attention cost, and lower inference cost per character of output.

The numbers are striking. With a 32K vocabulary, Chinese text requires 2.5x more tokens per character than English. This means a Chinese user processing a document of equivalent semantic content pays 2.5x more in attention cost and 2.5x more in API tokens-per-dollar. Expanding the vocabulary to 128K nearly eliminates this gap.

The Memory Cost

The flip side: bigger vocabularies mean bigger embedding tables. The embedding table and the language model head (which is often a tied copy of the embedding table, or an untied matrix of the same shape) together account for:

$\text{Embedding params} = V \times d_{\text{model}} \times (1 + \text{untied})$

where $\text{untied} = 1$ if the LM head has separate weights (common in large models), and $0$ if weights are tied.

📊

Embedding Table Size by Vocabulary Size (d_model = 4096, FP16)

Vocab Size	Embedding Params	Memory (tied)	Memory (untied)	Share of 7B Model
32,000	131M	250 MB	500 MB	1.9% / 3.7%
50,000	205M	391 MB	781 MB	2.9% / 5.6%
100,000	410M	781 MB	1,562 MB	5.9% / 11.2%
128,000	524M	1,000 MB	2,000 MB	7.5% / 14.3%
256,000	1,049M	2,000 MB	4,000 MB	15.0% / 28.6%

Note: 'Share of 7B Model' shows percentage of total parameters. Beyond ~128K, the embedding table becomes a significant fraction of model parameters.

At 128K vocabulary, the embedding table is 1 GB (tied) or 2 GB (untied) in FP16. This is not catastrophic for a 7B model (14 GB total weights), but it is substantial — and it scales linearly with vocabulary size. A 256K vocabulary would consume 4 GB for the embedding parameters alone in the untied case.

Compute Cost at the Boundaries

The embedding lookup itself is cheap — just a table lookup, $O(n)$ in sequence length. But the language model head at the output layer is a full matrix multiplication: the hidden state of dimension $d_{\text{model}}$ is multiplied by the vocabulary projection matrix $W \in \mathbb{R}^{d_{\text{model}} \times V}$ to produce logits for all $V$ vocabulary entries. This costs $2 \times n \times d_{\text{model}} \times V$ FLOPs per forward pass.

For a single decode step ( $n = 1$ ) with $d = 4096$ :

At $V = 32K$ : $2 \times 1 \times 4096 \times 32{,}000 = 262$ MFLOPs
At $V = 128K$ : $2 \times 1 \times 4096 \times 128{,}000 = 1{,}049$ MFLOPs

The LM head at 128K vocabulary costs 4x more than at 32K. During decode, when every microsecond of latency matters, this is a measurable difference — though it is still much smaller than the total cost of all transformer layers.

LM Head FLOPs per Decode Step by Vocabulary Size (d=4096)

(MFLOPs)

32K vocab GPT-2, Llama 1

262 MFLOPs

50K vocab Llama 2

410 MFLOPs

+56.5%

100K vocab GPT-4, Qwen

819 MFLOPs

+212.6%

128K vocab Llama 3

1,049 MFLOPs

+300.4%

The Optimal Vocabulary Size

There is no single optimal vocabulary size — the right choice depends on the model’s intended use. Here is the current consensus among practitioners:

32K is sufficient for English-dominant models with short context. It keeps embedding tables small and works well for English and code.
50-65K is a reasonable middle ground for bilingual (English + one other language) models.
100-128K is the current standard for multilingual models and models expected to handle diverse inputs (code, math, multiple languages). The compression gains for non-English text more than compensate for the larger embedding table.
Beyond 128K shows diminishing returns. The marginal compression gain per additional vocabulary entry decreases as you add rarer and rarer subwords, while the embedding table cost grows linearly.

⚡ The Practitioner's Rule

If your model will serve multilingual users or process code and structured data: use 100K+ vocabulary. The 15-30% reduction in token count for non-English text directly translates to 15-30% lower inference cost (fewer KV cache entries, fewer decode steps, fewer API tokens billed). The embedding table overhead is modest compared to the savings.

Part 5: Special Tokens

Every tokenizer vocabulary includes a set of special tokens — tokens that do not correspond to any text in the input but serve structural roles in the model’s input/output protocol. Understanding these tokens is essential for correct inference and fine-tuning.

The Core Special Tokens

BOS (Beginning of Sequence): Signals the start of a new input. The model’s first token is always BOS, which provides a consistent “starting state” for the first attention layer. Without BOS, the first token would have no context to attend to, and its representation would depend entirely on its position embedding.

EOS (End of Sequence): Signals that the model should stop generating. During autoregressive inference, the generation loop runs until the model outputs the EOS token (or a maximum length is reached). If the tokenizer lacks a proper EOS token, or if the model was not trained with one, the model may generate indefinitely.

PAD (Padding): Used to pad sequences to equal length within a batch. Padded positions are masked out in attention (set to $-\infty$ before softmax) so they do not influence computation. PAD tokens are essential for batched inference and training but have no semantic meaning.

UNK (Unknown): A fallback token for inputs that cannot be encoded. Modern byte-level tokenizers (tiktoken, Llama 3’s tokenizer) do not need UNK because every byte is in the vocabulary. Older tokenizers that operate at the Unicode codepoint level may still produce UNK for very rare characters.

Chat Template Tokens

Modern instruction-following models use additional special tokens to delineate the roles within a conversation:

<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is BPE?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

These tokens serve a critical function: they let the model distinguish between system instructions, user input, and its own output. During instruction tuning, the model learns that content after <|start_header_id|>system<|end_header_id|> should be treated as behavioral constraints, and content after the assistant header is where it should generate its response.

The exact token format varies by model family. Llama 3 uses the format shown above. ChatGPT uses <|im_start|> and <|im_end|>. Mistral uses [INST] and [/INST]. Getting these tokens wrong — for example, using ChatGPT’s template format with a Llama model — will produce garbage output because the model has never seen those token sequences during training.

⚠️ Chat Templates Are Not Interchangeable

Using the wrong chat template is one of the most common deployment errors. A model will not crash or throw an error — it will simply produce degraded output because the special token sequences do not match what it learned during instruction tuning. Always use the template specified in the model’s tokenizer configuration.

EOS and Inference Cost

The EOS token has a direct performance implication: it determines when generation stops. A model that does not reliably produce EOS will generate tokens until hitting the maximum length limit, wasting compute and producing trailing garbage. Conversely, a model that produces EOS too early will truncate its responses.

During beam search or sampling, EOS detection must be checked at every decode step. Some serving frameworks implement “stop sequences” — multi-token patterns that trigger early stopping — which require the tokenizer to check whether the most recent tokens match a stop pattern. This is a small but measurable overhead at each decode step.

Part 6: Tokenization Artifacts and Failure Modes

Tokenization is not a neutral preprocessing step. It introduces artifacts — systematic biases and failure modes that propagate through the entire model. Some of these are well-known; others are subtle and only manifest in specific contexts.

Glitch Tokens: The SolidGoldMagikarp Problem

In 2023, researchers discovered that certain tokens in GPT-2 and GPT-3’s vocabulary produced bizarre behavior when used as prompts. The most famous example was the token for ” SolidGoldMagikarp” — a Reddit username that appeared frequently enough in the training data to earn its own token during BPE training, but then appeared in very different contexts (or not at all) during the model’s language model training.

The mechanism is straightforward. BPE vocabulary training and language model training use different datasets (or the same dataset is processed differently). A token can appear frequently enough in the BPE training corpus to be assigned its own vocabulary entry, but then appear rarely or in unusual contexts during model training. The model’s embedding for this token is undertrained — it occupies a random region of embedding space and produces unpredictable outputs when encountered.

This failure mode is specific to the gap between tokenizer training and model training. If both use identical data and the token has sufficient frequency in both stages, the problem does not arise.

Token Boundary Effects on Arithmetic

GPT models famously struggle with arithmetic. A significant part of this difficulty comes from tokenization. Consider the number “12345”:

It might be tokenized as "123" + "45" (two tokens)
Or as "1234" + "5" (two tokens with a different split)
Or as "12345" (one token, if it is in the vocabulary)

The model does not see the individual digits — it sees arbitrary substrings of digits, with boundaries that depend on what numbers appeared in the BPE training corpus. The number “380” might be a single token while “381” is split into “38” + “1”. This means the model must learn arithmetic not over digits but over inconsistent substrings of digits. The operation “380 + 1 = 381” requires the model to understand that adding 1 to the single token “380” produces the two tokens “38” + “1” — a fundamentally different representation.

− What Humans See

Calculate: 1273 + 5684

The answer is 6957.

Each digit is clearly visible:
  1 2 7 3
+ 5 6 8 4
-------
  6 9 5 7

→

+ What GPT-4 Sees (Tokens)

Calculate: |12|73| |+| |56|84|

The answer is |69|57|.

Token boundaries obscure digits:
  |12|73|
+ |56|84|
-----------
  |69|57|

Character Counting Failures

“How many r’s are in ‘strawberry’?” is a famously difficult question for LLMs. The reason is tokenization: “strawberry” is typically tokenized as "str" + "aw" + "berry" or similar. The model never sees individual characters — it sees subword tokens. To count the occurrences of “r” in “strawberry”, the model would need to decompose its tokens back into characters, which is an operation it was never explicitly trained to perform. The model must implicitly learn the character composition of every token in its vocabulary — a task that is possible in principle but difficult in practice.

Multi-Byte UTF-8 and Token Boundaries

UTF-8 encodes Unicode characters using 1 to 4 bytes. ASCII characters use 1 byte, common accented characters use 2 bytes, CJK characters use 3 bytes, and emoji use 4 bytes. In byte-level BPE, each byte is a base vocabulary entry. A CJK character is initially represented as 3 separate byte tokens.

If the BPE training corpus contains enough CJK text, these 3-byte sequences will frequently co-occur and merge into single tokens. But if the training corpus is English-heavy, the byte sequences for CJK characters may never merge completely, leaving CJK text represented as 2-3x more tokens than necessary.

This is the byte-level fallback’s hidden cost: it guarantees coverage (no UNK tokens) but the quality of coverage depends entirely on the BPE training corpus distribution. GPT-4’s 100K vocabulary includes many merged CJK byte sequences, but a 32K vocabulary trained on English-heavy data will represent Chinese text with far more tokens per character.

🚨 The Silent Multilingual Tax

A tokenizer trained on English-heavy data can make Chinese or Arabic text 2-3x more expensive to process than English text of equivalent semantic content. This is not a model limitation — it is a tokenizer limitation. The fix is a larger vocabulary with more diverse training data.

Part 7: Performance — Tokenizer Throughput

Tokenization is a CPU operation. It happens before any GPU computation and after all GPU computation (for decoding tokens back to text). In most inference pipelines, it is fast enough that nobody thinks about it. But at extreme scale, tokenizer throughput can become relevant.

Pure Python vs Rust-backed Implementations

The performance gap between a naive Python tokenizer and a compiled implementation is enormous. A pure Python BPE implementation processes on the order of 10-50 KB/s. The Rust-backed implementations (tiktoken, HuggingFace tokenizers) process 10-100 MB/s — a 1,000-2,000x speedup.

This gap exists because BPE encoding requires intensive string manipulation: scanning for pairs, performing replacements, rescanning. Python’s interpreter overhead and string immutability make this extremely slow. Rust’s zero-cost abstractions and mutable string handling eliminate this overhead almost entirely.

Tokenizer Throughput (Single Thread, English Text)

(MB/s)

Python BPE (naive) ~40 KB/s

0.04 MB/s

SentencePiece (C++)

25 MB/s

+62400.0%

HuggingFace tokenizers (Rust)

65 MB/s

+162400.0%

tiktoken (Rust) Fastest single-thread

80 MB/s

+199900.0%

Batched Tokenization

All three major implementations support batched tokenization — processing multiple strings in parallel. HuggingFace tokenizers can parallelize across CPU cores using its built-in parallelism:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-3-8B")

# Single string - uses one core
encoded = tokenizer.encode("Hello world")

# Batch of strings - automatically parallelized across cores
batch_encoded = tokenizer.encode_batch([
    "First document...",
    "Second document...",
    "Third document...",
    # ... thousands of documents
])

With 8 CPU cores, batched tokenization achieves roughly 6-7x the single-thread throughput (not a perfect 8x due to overhead and memory contention).

Pre-Tokenization as a Parallelism Opportunity

The regex-based pre-tokenization step (splitting text into words/chunks before applying BPE) is embarrassingly parallel. Each chunk can be BPE-encoded independently because BPE merges never cross chunk boundaries. This is why tiktoken’s regex-first design is particularly fast: the regex split produces many small independent chunks that can be processed in parallel, while the BPE encoding of each chunk is a small, cache-friendly operation.

When Throughput Matters

For the vast majority of applications, tokenizer throughput is irrelevant. At 80 MB/s, tiktoken can tokenize a 1 MB document in 12.5 milliseconds — far less than the time the GPU spends processing the resulting tokens. Even for a 100 MB corpus, tokenization takes about 1.25 seconds single-threaded.

Tokenizer throughput becomes relevant in two scenarios:

Offline data preprocessing. When tokenizing terabytes of training data for LLM pre-training, the total tokenization time can be hours or days. Using a fast Rust-backed tokenizer instead of a Python implementation saves significant wall clock time. The Llama 3 training corpus of ~15 trillion tokens required tokenizing petabytes of raw text — even at 80 MB/s, this is weeks of single-core time.

Ultra-high-throughput serving. If your inference server handles thousands of requests per second, the CPU time for tokenization adds up. At 10,000 requests/s with an average 500 bytes per request, you need 5 MB/s of sustained tokenization throughput — easily achievable with any compiled implementation, but potentially problematic with pure Python.

ℹ️ Profile Before Optimizing

If your inference pipeline is slow, the tokenizer is almost certainly not the bottleneck. Profile first. The GPU time for a single forward pass through a 7B-parameter model is 10-100 ms; tokenizing the input takes under 1 ms. The tokenizer is 100-1000x faster than the model.

Part 8: The Connection to Inference Cost

Tokenization directly controls inference cost through two mechanisms: the number of tokens in the input (which determines prompt processing cost) and the number of tokens in the output (which determines generation cost). Understanding this connection is critical for capacity planning and cost optimization.

Fewer Tokens = Faster Inference = Lower Cost

The cost of processing a prompt is dominated by the attention computation, which scales as $O(n^2)$ in the number of tokens. The cost of generating each output token is dominated by KV cache reads, which scale as $O(n)$ in the current sequence length. In both cases, fewer tokens means less work.

A tokenizer that produces 20% fewer tokens for the same text reduces:

Prefill compute by roughly 36% ( $0.8^2 = 0.64$ , so $1 - 0.64 = 36\%$ reduction)
KV cache memory by 20% (linear in token count)
Generation cost by 20% (fewer decode steps to produce the same text, fewer KV entries to read per step)
API cost by 20% (most APIs bill per token)

This is why Llama 3’s move from a 32K to a 128K vocabulary was significant. For Chinese text, the token count dropped by roughly 50% compared to Llama 1’s 32K vocabulary. That is a 75% reduction in prefill attention cost and a 50% reduction in generation cost — for the same text, with the same model quality.

Token Reduction Impact on Inference Cost (Llama 3 vs Llama 1 Tokenizer)

(% of Llama 1 cost)

English: Llama 1 (32K vocab) Baseline

100 % of Llama 1 cost

English: Llama 3 (128K vocab) ~18% fewer tokens

82 % of Llama 1 cost

Chinese: Llama 1 (32K vocab) Baseline

100 % of Llama 1 cost

Chinese: Llama 3 (128K vocab) ~54% fewer tokens

46 % of Llama 1 cost

Code: Llama 1 (32K vocab) Baseline

100 % of Llama 1 cost

Code: Llama 3 (128K vocab) ~29% fewer tokens

71 % of Llama 1 cost

The Embedding Table Tradeoff

The cost of a larger vocabulary is not free. The embedding table and LM head must be loaded from HBM at least once per forward pass. During autoregressive decoding, the LM head computation ( $d_{\text{model}} \times V$ matrix-vector multiply) happens at every single decode step.

For a 128K vocabulary with $d_{\text{model}} = 4096$ :

Embedding table: $128K \times 4096 \times 2 = 1{,}000$ MB in FP16
LM head weight read per decode step: 1,000 MB
At 2 TB/s HBM bandwidth: 0.5 ms per decode step just for the LM head

Compare to a 32K vocabulary:

LM head weight read: 250 MB per step
At 2 TB/s: 0.125 ms per step

The difference is 0.375 ms per decode step. For a 500-token generation, that is 187.5 ms of additional latency — noticeable but not catastrophic. And this cost is fully offset if the 128K vocabulary produces even a modest reduction in the number of decode steps needed.

KV Cache: Where the Real Savings Are

The largest inference cost saving from better tokenization comes through the KV cache. Every token in the sequence occupies KV cache memory in every layer of the model:

$\text{KV cache per token} = 2 \times L \times d_{\text{model}} \times \text{bytes\_per\_element}$

For Llama-3 8B ( $L = 32$ , $d = 4096$ , FP16): $2 \times 32 \times 4096 \times 2 = 512$ KB per token per request.

A 20% reduction in token count saves 20% KV cache memory. In a serving system with limited GPU memory, this means:

25% more concurrent requests (more requests fit in the same KV cache budget)
25% higher throughput
20% lower per-request latency (less KV cache to read during each decode step)

At scale, this is enormous. A serving cluster handling 10,000 concurrent requests with an average context length of 4,000 tokens consumes $10{,}000 \times 4{,}000 \times 512$ KB $= 19.5$ TB of KV cache memory. A 20% reduction saves 3.9 TB — enough to serve 2,500 additional concurrent requests, or to reduce the cluster size by 20%.

📊

Tokenizer Impact on Serving Economics

Metric	32K Vocab	128K Vocab	Savings
Tokens per 1K chars (Chinese)	~700	~320	54% fewer tokens
KV cache per request (4K context)	2.0 GB	0.9 GB	55% less memory
Concurrent requests (80GB GPU)	40	88	2.2x more requests
Embedding table memory	250 MB	1,000 MB	+750 MB overhead
LM head latency per step	0.125 ms	0.5 ms	+0.375 ms per step
Net throughput gain (Chinese)	--	--	+80% to +120%

Note: For Chinese text workloads. The KV cache savings from fewer tokens far outweigh the embedding table overhead. English text sees smaller but still significant gains.

The Bottom Line

Tokenization is not a preprocessing detail — it is a core economic parameter of LLM deployment. The tokenizer determines how many tokens your users consume, which directly determines your compute cost, memory footprint, and serving capacity. When evaluating models for production deployment, the tokenizer’s compression ratio on your specific data distribution should be part of the evaluation criteria alongside perplexity, accuracy, and latency.

Conclusion

Tokenization sits at the boundary between human-readable text and the numerical representations that transformers operate on. It is the first transformation in the pipeline and it shapes everything downstream.

The key insights from this deep dive:

Subword tokenization is a compression problem. BPE finds the most compressible representation of the training corpus within a vocabulary budget. Common sequences get short encodings (single tokens); rare sequences get long encodings (multiple tokens). The vocabulary budget controls the tradeoff between sequence length and embedding table size.

Vocabulary size is a first-order design decision. The difference between 32K and 128K vocabulary is not academic — it determines whether Chinese text costs 2x or 1x what English text costs to process. For multilingual or code-heavy workloads, a large vocabulary pays for itself many times over through reduced inference cost.

Tokenization has failure modes. Glitch tokens, arithmetic difficulties, character-counting failures, and the multilingual tax are all direct consequences of how the tokenizer splits text. These are not model failures — they are tokenizer failures that the model cannot overcome.

The tokenizer-inference cost connection is quantitative and significant. A 20% reduction in token count translates to 36% less prefill compute, 20% less KV cache memory, and 25% more concurrent requests. At serving scale, this is worth millions of dollars per year.

In Part 3 of this series, we will follow the tokens as they pass through the embedding layer and positional encoding — the mechanism that gives the transformer its sense of position within a sequence. We will see how token IDs become dense vectors, why positional information cannot be baked into the embeddings alone, and how RoPE and ALiBi changed the game for long-context models.

💡 Series Navigation

Part 1: The Transformer Attention Mechanism — How attention works, its $O(n^2)$ cost, prefill vs decode profiles. Part 2: Tokenization and BPE (this post) — How text becomes tokens, vocabulary tradeoffs, the cost connection. Part 3: Embeddings and Positional Encoding (coming next) — From token IDs to dense vectors, RoPE, ALiBi.

Part 1: Why Not Characters? Why Not Words?

Character-Level Tokenization

Word-Level Tokenization

The Subword Sweet Spot

Tokenization Strategy Comparison

Part 2: Byte Pair Encoding from Scratch

The Algorithm

Worked Example

The Compression Mental Model

BPE Encoding (Inference Time)

Part 3: SentencePiece vs tiktoken vs HuggingFace Tokenizers

SentencePiece

tiktoken

HuggingFace Tokenizers

Tokenizer Implementation Comparison

When to Use Each

Part 4: Vocabulary Size Tradeoffs

The Core Tradeoff

Compression Ratio Across Languages

Tokens per 1,000 Characters by Vocabulary Size

The Memory Cost

Embedding Table Size by Vocabulary Size (d_model = 4096, FP16)

Compute Cost at the Boundaries

LM Head FLOPs per Decode Step by Vocabulary Size (d=4096)

The Optimal Vocabulary Size

Part 5: Special Tokens

The Core Special Tokens

Chat Template Tokens

EOS and Inference Cost

Part 6: Tokenization Artifacts and Failure Modes

Glitch Tokens: The SolidGoldMagikarp Problem

Token Boundary Effects on Arithmetic

Character Counting Failures

Multi-Byte UTF-8 and Token Boundaries

Part 7: Performance — Tokenizer Throughput

Pure Python vs Rust-backed Implementations

Tokenizer Throughput (Single Thread, English Text)

Batched Tokenization

Pre-Tokenization as a Parallelism Opportunity

When Throughput Matters

Part 8: The Connection to Inference Cost

Fewer Tokens = Faster Inference = Lower Cost

Token Reduction Impact on Inference Cost (Llama 3 vs Llama 1 Tokenizer)

The Embedding Table Tradeoff

KV Cache: Where the Real Savings Are

Tokenizer Impact on Serving Economics

The Bottom Line

Conclusion

Stanley Phoong

Related Posts

Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling

Attention Variants Compared: MHA, MQA, GQA, and MLA

Building a Transformer From Scratch: Putting Every Component Together