In Part 1 of this series, we dissected the attention mechanism — how queries match against keys, how softmax produces a probability distribution over positions, how the cost in sequence length dominates everything at long contexts. But we skipped a question that logically comes first: how does raw text become the sequence of token embeddings that attention operates on?
The answer is tokenization, and it is far less innocent than it looks. The tokenizer is the very first thing that touches your input and the very last thing that produces your output. Every decision it makes — how it splits words, how large its vocabulary is, whether it has seen a particular script before — silently propagates through the entire model. Tokenization determines sequence length, which determines attention cost. Tokenization determines vocabulary size, which determines embedding table memory. Tokenization determines how the model “sees” arithmetic, code, multilingual text, and rare words.
This post builds tokenization from first principles. We will start with why the obvious approaches (characters and words) fail, derive Byte Pair Encoding from scratch with a worked example, compare the three dominant tokenizer implementations, analyze the vocabulary size tradeoff with real numbers, catalog the failure modes that trip up practitioners, and connect tokenization directly to inference cost.
Part 1: Why Not Characters? Why Not Words?
Before subword tokenization existed, there were two obvious strategies: tokenize at the character level or tokenize at the word level. Both fail, but understanding exactly how they fail reveals why subwords are the right answer.
Character-Level Tokenization
The simplest possible tokenizer maps each character to a token. The English alphabet gives you roughly 26 lowercase + 26 uppercase + 10 digits + punctuation — about 100 tokens in total. For Unicode, you might need a few thousand entries to cover the common scripts, but the vocabulary stays small and finite.
The appeal is obvious: every possible string can be represented. There are no unknown words, no out-of-vocabulary problems, no language-specific rules. The tokenizer is trivially simple — just iterate through characters.
The problem is sequence length. The sentence “The transformer architecture revolutionized natural language processing” is 68 characters but only 8 words. At the character level, the attention mechanism must process an element attention matrix. At the word level, that same matrix is elements — a 72x reduction.
Recall from Part 1 that attention cost scales as where is sequence length. Character-level tokenization inflates by roughly 5-6x compared to word-level tokenization for English text. That means attention cost increases by 25-36x. For a 4,096-token context window at the word level, you would need a 20,000-25,000-token context window at the character level to represent the same amount of text — and the attention cost would be 25-36x higher.
There is a second, subtler problem. Characters carry very little semantic information individually. The letter “t” means nothing by itself — the model must learn to compose characters into meaningful units across many positions. This means the model needs more layers and more capacity to learn the same linguistic patterns that a word-level model picks up easily. Empirically, character-level language models require significantly more compute to reach the same perplexity as word-level models on English text.
Character-level tokenization does not just make sequences longer — it makes them quadratically more expensive. A 5x increase in sequence length produces a 25x increase in attention FLOPs. This is the fundamental reason character-level models never scaled to the sizes we see today.
Word-Level Tokenization
The opposite extreme: split on whitespace and punctuation, and assign each unique word a token ID. English text compresses well — “The transformer architecture” becomes just 3 tokens. Sequences are short, attention matrices are small, and each token carries rich semantic information.
The problem is the vocabulary. The English language has over 170,000 words in current use, but that number explodes once you include proper nouns, technical terms, compound words, misspellings, morphological variants, and other languages. “running”, “runs”, “ran”, “runner”, “runners” are all separate vocabulary entries despite sharing the root “run”. Code identifiers like getMaxBatchSize or num_attention_heads are unique words that appear rarely.
A word-level vocabulary of 170,000 entries creates two problems. First, the embedding table has 170,000 rows, each of dimension . For , that is GB in FP16 — a substantial fraction of a 7B-parameter model’s total memory. Second, and more critically, any word not in the vocabulary is mapped to a single [UNK] (unknown) token, destroying all information about what that word actually was. A medical model encountering “hydroxychloroquine” for the first time would see it as [UNK] — identical to every other unknown word.
The open vocabulary problem is unsolvable at the word level. No finite word vocabulary can cover all possible inputs. Languages with productive morphology (German, Finnish, Turkish) can generate compound words that have never appeared in any training corpus. Code, URLs, and mathematical expressions contain arbitrary symbol sequences. Even restricting to English, new words are coined constantly.
The Subword Sweet Spot
Subword tokenization resolves the tension between these two extremes. The core idea: maintain a vocabulary of common subword units — some are full words, some are word fragments, some are individual characters. Common words like “the” and “is” get their own tokens. Uncommon words are decomposed into known subword pieces.
The word “understanding” might be tokenized as a single token if it is common enough, or split into “under” + “standing”, or further into “under” + “stand” + “ing”. The word “hydroxychloroquine” might become “hydro” + “xy” + “chlor” + “o” + “quine”. Every possible string can be represented because individual bytes or characters serve as a fallback — the vocabulary is effectively closed (no [UNK] needed) while remaining finite and manageable.
The tradeoff is explicit and tunable: vocabulary size controls the balance between sequence length and embedding table size. A 32K vocabulary produces slightly longer sequences than a 128K vocabulary but requires a 4x smaller embedding table. This is a knob you can turn, and different models turn it to different positions based on their priorities.
Tokenization Strategy Comparison
| Strategy | Vocab Size | Seq Length (English) | OOV Handling | Multilingual |
|---|---|---|---|---|
| Character-level | ~100-300 | 5-6x longer | No OOV (complete) | Good coverage |
| Word-level | 170K+ | 1x (baseline) | UNK token (lossy) | Requires per-language vocab |
| Subword (BPE, 32K) | 32,000 | ~1.3x | Decomposition (lossless) | Moderate |
| Subword (BPE, 128K) | 128,000 | ~1.0x | Decomposition (lossless) | Good |
Part 2: Byte Pair Encoding from Scratch
Byte Pair Encoding (BPE) is the dominant subword tokenization algorithm. It was originally a data compression technique invented by Philip Gage in 1994, adapted for NLP by Sennrich et al. in 2016. The mental model is this: BPE finds the most compressible representation of the training corpus within a given vocabulary budget. It greedily merges the most frequent character pairs until the vocabulary reaches the target size.
The Algorithm
The BPE training algorithm is beautifully simple:
- Start with a vocabulary containing all individual characters (or bytes) that appear in the training corpus.
- Count the frequency of every adjacent pair of tokens in the corpus.
- Merge the most frequent pair into a single new token. Add it to the vocabulary.
- Replace all occurrences of that pair in the corpus with the new token.
- Repeat from step 2 until the vocabulary reaches the desired size.
That is the entire algorithm. Let us walk through it on a concrete example.
Worked Example
Consider a small training corpus consisting of repeated occurrences of these words (with frequencies):
"low" : 5 times
"lower" : 2 times
"newest" : 6 times
"widest" : 3 times
Step 0: Initialize character vocabulary
We split every word into characters and add a special end-of-word symbol _ to mark word boundaries. The initial representation is:
"l o w _" : 5
"l o w e r _" : 2
"n e w e s t _" : 6
"w i d e s t _" : 3
Initial vocabulary: {l, o, w, e, r, n, s, t, i, d, _} (11 tokens)
Iteration 1: Count all adjacent pairs
Pair frequencies:
(e, s) : 6 + 3 = 9 <- from "newest" and "widest"
(s, t) : 6 + 3 = 9 <- from "newest" and "widest"
(t, _) : 6 + 3 = 9 <- from "newest" and "widest"
(l, o) : 5 + 2 = 7 <- from "low" and "lower"
(o, w) : 5 + 2 = 7 <- from "low" and "lower"
(n, e) : 6 <- from "newest"
(e, w) : 6 <- from "newest"
(w, e) : 2 + 6 = 8 <- from "lower" and "newest"
(w, _) : 5 <- from "low"
(w, i) : 3 <- from "widest"
(i, d) : 3 <- from "widest"
(d, e) : 3 <- from "widest"
(e, r) : 2 <- from "lower"
(r, _) : 2 <- from "lower"
Three pairs tie at frequency 9: (e, s), (s, t), and (t, _). We pick (e, s) (tie-breaking by order).
Merge: e + s becomes the new token es.
"l o w _" : 5
"l o w e r _" : 2
"n e w es t _" : 6
"w i d es t _" : 3
Vocabulary: {l, o, w, e, r, n, s, t, i, d, _, es} (12 tokens)
Iteration 2: Recount pairs
The most frequent pair is now (es, t) with frequency 9 (from “newest” and “widest”).
Merge: es + t becomes est.
"l o w _" : 5
"l o w e r _" : 2
"n e w est _" : 6
"w i d est _" : 3
Vocabulary: 13 tokens (added est)
Iteration 3: The most frequent pair is (est, _) with frequency 9.
Merge: est + _ becomes est_.
"l o w _" : 5
"l o w e r _" : 2
"n e w est_" : 6
"w i d est_" : 3
Vocabulary: 14 tokens (added est_)
Iteration 4: Now (l, o) and (o, w) both have frequency 7. Pick (l, o).
Merge: l + o becomes lo.
"lo w _" : 5
"lo w e r _" : 2
"n e w est_" : 6
"w i d est_" : 3
Vocabulary: 15 tokens (added lo)
Iteration 5: (lo, w) has frequency 7.
Merge: lo + w becomes low.
"low _" : 5
"low e r _" : 2
"n e w est_" : 6
"w i d est_" : 3
Vocabulary: 16 tokens (added low)
Iteration 6: (low, _) has frequency 5, while (n, e), (e, w), and (w, est_) each have frequency 6. The highest is (e, w) at frequency 6.
Merge: e + w becomes ew.
"low _" : 5
"low e r _" : 2
"n ew est_" : 6
"w i d est_" : 3
Vocabulary: 17 tokens (added ew)
After just 6 iterations, observe what has happened. The suffix “-est” has been captured as a single token est_ — the algorithm discovered that this is a common English morphological pattern. The prefix “low” has been captured as a single token. The algorithm did not know anything about English morphology; it simply followed the statistics of character co-occurrence.
If we continued, the next merges would likely produce new from (n, ew), then newest_ from (new, est_). The algorithm would eventually discover that “low_” and “lower” share a common prefix, and might create tokens for common suffixes like “-er”, “-ing”, “-tion”.
The Compression Mental Model
BPE is fundamentally a compression algorithm. Each merge reduces the total number of tokens in the corpus by replacing two tokens with one. The merge that reduces the corpus length the most is the one that replaces the most frequent pair — which is exactly what BPE selects.
After merges, the vocabulary has grown by entries and the corpus has been compressed by approximately tokens, where is the frequency of the pair merged at step . The algorithm is greedy: it picks the locally optimal merge at each step without considering future consequences. This greedy strategy works remarkably well in practice because high-frequency pairs tend to remain useful as the vocabulary grows.
The vocabulary budget is the stopping condition. If you want a 32K vocabulary and your initial character set has 256 entries (byte-level BPE), you perform merge operations. Each merge adds one token to the vocabulary and makes the corpus slightly shorter.
The key mental model: a BPE vocabulary of size is the set of substrings that gives the best compression of the training corpus under a greedy merging strategy. Frequent substrings get their own tokens (short encoding), rare substrings are decomposed into characters (long encoding). This is exactly analogous to Huffman coding — common patterns get short codes.
BPE Encoding (Inference Time)
Training produces a merge table — an ordered list of merge rules. To tokenize a new string at inference time:
- Split the string into characters (or bytes).
- Apply merge rules in priority order: find the highest-priority merge that can be applied, apply it, and repeat.
- Continue until no more merges are applicable.
def bpe_encode(text, merge_rules):
"""
Apply BPE encoding to a string given trained merge rules.
merge_rules: list of (pair, merged_token) in priority order
"""
# Start with character-level tokens
tokens = list(text)
for pair, merged in merge_rules:
i = 0
new_tokens = []
while i < len(tokens):
# Look for the pair at position i
if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
new_tokens.append(merged)
i += 2 # Skip both tokens in the pair
else:
new_tokens.append(tokens[i])
i += 1
tokens = new_tokens
return tokens
This is where is the string length and is the number of merge rules. In practice, optimized implementations use hash tables and priority queues to bring this close to .
Part 3: SentencePiece vs tiktoken vs HuggingFace Tokenizers
Three implementations dominate the LLM ecosystem. They all implement BPE (and sometimes alternative algorithms), but differ in their pre-tokenization strategy, implementation language, and design philosophy.
SentencePiece
SentencePiece, developed by Google, is the most language-agnostic of the three. Its defining feature: it operates on raw Unicode strings without any language-specific pre-tokenization. Where other tokenizers split on whitespace first and then apply BPE within each word, SentencePiece treats the entire input — including spaces — as a flat sequence of characters. Spaces are represented as the Unicode character \u2581 (lower one eighth block, visually rendered as _), making them explicit tokens that participate in BPE merges.
This design has a crucial advantage: it works identically for languages that do not use whitespace word boundaries (Chinese, Japanese, Thai). A whitespace-based pre-tokenizer would treat an entire Chinese sentence as a single “word”, making BPE within it behave very differently from English. SentencePiece avoids this asymmetry entirely.
SentencePiece also implements the Unigram model as an alternative to BPE. Instead of building a vocabulary bottom-up through merges, the Unigram model starts with a large initial vocabulary and iteratively removes tokens that contribute least to the corpus likelihood. The Unigram model optimizes a probabilistic objective (maximizing the likelihood of the training data under a unigram language model over subword units) and can find more globally optimal vocabularies than BPE’s greedy approach.
Used by: Llama 1, Llama 2, T5, ALBERT, XLNet, Mistral (early versions)
tiktoken
tiktoken is OpenAI’s tokenizer library, purpose-built for the GPT family of models. It implements byte-level BPE with regex-based pre-tokenization. Before BPE merges are applied, tiktoken uses a regex pattern to split the input into coarse chunks:
# Simplified version of GPT-4's pre-tokenization pattern
import regex
pattern = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
This regex splits text into words, numbers (in groups of up to 3 digits), punctuation, and whitespace. BPE is then applied independently within each chunk. The pre-tokenization prevents BPE from merging across word boundaries — the token “the” can never merge with a following space and the first letter of the next word.
tiktoken is implemented in Rust with Python bindings, making it extremely fast. Its byte-level approach means the base vocabulary is exactly 256 entries (one per byte value), ensuring that any byte sequence can be encoded — no [UNK] token is ever needed.
Used by: GPT-2, GPT-3, GPT-3.5, GPT-4, GPT-4o
HuggingFace Tokenizers
The HuggingFace tokenizers library is the Swiss Army knife of tokenization. It provides a unified API for BPE, WordPiece (used by BERT), Unigram, and character-level models. The core is written in Rust for speed, with bindings for Python and Node.js.
Its modular architecture separates the pipeline into distinct stages: normalization (Unicode NFKC, lowercasing), pre-tokenization (whitespace splitting, byte-level mapping), the core model (BPE, WordPiece, Unigram), and post-processing (adding special tokens, building attention masks). Each stage is independently configurable.
The library’s primary strength is flexibility and ecosystem integration. It can load tokenizers from any HuggingFace model repository, train new tokenizers on custom data, and convert between formats. For production inference, it provides the same Rust-powered speed as tiktoken.
Used by: Most open-source models via the HuggingFace ecosystem (Llama 3, Qwen, Falcon, MPT, etc.)
Tokenizer Implementation Comparison
| Feature | SentencePiece | tiktoken | HuggingFace Tokenizers |
|---|---|---|---|
| Language | C++ | Rust + Python | Rust + Python |
| Pre-tokenization | None (whitespace as token) | Regex-based | Configurable |
| Algorithms | BPE, Unigram | Byte-level BPE | BPE, WordPiece, Unigram |
| Base vocabulary | Unicode codepoints | 256 bytes | Configurable |
| CJK handling | Native (no whitespace bias) | Via regex rules | Configurable |
| Speed (English, 1MB) | ~120 ms | ~45 ms | ~55 ms |
| Ecosystem | Google models | OpenAI models | Most open-source models |
When to Use Each
Use SentencePiece when training a model from scratch on heavily multilingual data, especially if your corpus includes significant CJK text. Its language-agnostic design avoids English-centric biases. Also use it when you want to experiment with the Unigram algorithm.
Use tiktoken when working with OpenAI models or when you need the fastest possible encoding speed. Its byte-level design is clean and predictable. It is also the right choice if you are building a tokenizer for a new model and want the simplest, most robust design.
Use HuggingFace Tokenizers when working with the broader open-source ecosystem, when you need to load pre-trained tokenizers from model repositories, or when you need maximum configurability over the tokenization pipeline.
Part 4: Vocabulary Size Tradeoffs
The vocabulary size is one of the most consequential hyperparameters in LLM design, yet it receives far less attention than model depth, width, or learning rate. It controls a three-way tradeoff between sequence length, embedding table size, and multilingual coverage.
The Core Tradeoff
A larger vocabulary means more subword units, which means common sequences get their own dedicated tokens instead of being split into pieces. This produces shorter token sequences for the same text — better compression. But each vocabulary entry requires a row in the embedding table ( parameters) and a row in the output projection / language model head ( parameters). The embedding table for a 128K vocabulary at is million parameters — about 7.5% of a 7B-parameter model.
Compression Ratio Across Languages
The most dramatic impact of vocabulary size is on multilingual performance. English text compresses well even with small vocabularies because English words are short and the Latin alphabet is compact. CJK languages (Chinese, Japanese, Korean) and languages with complex scripts (Arabic, Hindi, Thai) require many more bytes per character and have much larger character sets. A vocabulary trained primarily on English data will assign single tokens to common English words but split CJK characters into multiple byte-level tokens.
Tokens per 1,000 Characters by Vocabulary Size
| Language | 32K vocab (GPT-2) | 50K vocab (Llama 2) | 100K vocab (GPT-4) | 128K vocab (Llama 3) |
|---|---|---|---|---|
| English | ~280 tokens | ~265 tokens | ~240 tokens | ~230 tokens |
| Chinese | ~700 tokens | ~550 tokens | ~350 tokens | ~320 tokens |
| Japanese | ~650 tokens | ~520 tokens | ~330 tokens | ~300 tokens |
| Python code | ~350 tokens | ~320 tokens | ~270 tokens | ~250 tokens |
| Arabic | ~580 tokens | ~470 tokens | ~310 tokens | ~280 tokens |
The numbers are striking. With a 32K vocabulary, Chinese text requires 2.5x more tokens per character than English. This means a Chinese user processing a document of equivalent semantic content pays 2.5x more in attention cost and 2.5x more in API tokens-per-dollar. Expanding the vocabulary to 128K nearly eliminates this gap.
The Memory Cost
The flip side: bigger vocabularies mean bigger embedding tables. The embedding table and the language model head (which is often a tied copy of the embedding table, or an untied matrix of the same shape) together account for:
where if the LM head has separate weights (common in large models), and if weights are tied.
Embedding Table Size by Vocabulary Size (d_model = 4096, FP16)
| Vocab Size | Embedding Params | Memory (tied) | Memory (untied) | Share of 7B Model |
|---|---|---|---|---|
| 32,000 | 131M | 250 MB | 500 MB | 1.9% / 3.7% |
| 50,000 | 205M | 391 MB | 781 MB | 2.9% / 5.6% |
| 100,000 | 410M | 781 MB | 1,562 MB | 5.9% / 11.2% |
| 128,000 | 524M | 1,000 MB | 2,000 MB | 7.5% / 14.3% |
| 256,000 | 1,049M | 2,000 MB | 4,000 MB | 15.0% / 28.6% |
At 128K vocabulary, the embedding table is 1 GB (tied) or 2 GB (untied) in FP16. This is not catastrophic for a 7B model (14 GB total weights), but it is substantial — and it scales linearly with vocabulary size. A 256K vocabulary would consume 4 GB for the embedding parameters alone in the untied case.
Compute Cost at the Boundaries
The embedding lookup itself is cheap — just a table lookup, in sequence length. But the language model head at the output layer is a full matrix multiplication: the hidden state of dimension is multiplied by the vocabulary projection matrix to produce logits for all vocabulary entries. This costs FLOPs per forward pass.
For a single decode step () with :
- At : MFLOPs
- At : MFLOPs
The LM head at 128K vocabulary costs 4x more than at 32K. During decode, when every microsecond of latency matters, this is a measurable difference — though it is still much smaller than the total cost of all transformer layers.
LM Head FLOPs per Decode Step by Vocabulary Size (d=4096)
(MFLOPs)The Optimal Vocabulary Size
There is no single optimal vocabulary size — the right choice depends on the model’s intended use. Here is the current consensus among practitioners:
- 32K is sufficient for English-dominant models with short context. It keeps embedding tables small and works well for English and code.
- 50-65K is a reasonable middle ground for bilingual (English + one other language) models.
- 100-128K is the current standard for multilingual models and models expected to handle diverse inputs (code, math, multiple languages). The compression gains for non-English text more than compensate for the larger embedding table.
- Beyond 128K shows diminishing returns. The marginal compression gain per additional vocabulary entry decreases as you add rarer and rarer subwords, while the embedding table cost grows linearly.
If your model will serve multilingual users or process code and structured data: use 100K+ vocabulary. The 15-30% reduction in token count for non-English text directly translates to 15-30% lower inference cost (fewer KV cache entries, fewer decode steps, fewer API tokens billed). The embedding table overhead is modest compared to the savings.
Part 5: Special Tokens
Every tokenizer vocabulary includes a set of special tokens — tokens that do not correspond to any text in the input but serve structural roles in the model’s input/output protocol. Understanding these tokens is essential for correct inference and fine-tuning.
The Core Special Tokens
BOS (Beginning of Sequence): Signals the start of a new input. The model’s first token is always BOS, which provides a consistent “starting state” for the first attention layer. Without BOS, the first token would have no context to attend to, and its representation would depend entirely on its position embedding.
EOS (End of Sequence): Signals that the model should stop generating. During autoregressive inference, the generation loop runs until the model outputs the EOS token (or a maximum length is reached). If the tokenizer lacks a proper EOS token, or if the model was not trained with one, the model may generate indefinitely.
PAD (Padding): Used to pad sequences to equal length within a batch. Padded positions are masked out in attention (set to before softmax) so they do not influence computation. PAD tokens are essential for batched inference and training but have no semantic meaning.
UNK (Unknown): A fallback token for inputs that cannot be encoded. Modern byte-level tokenizers (tiktoken, Llama 3’s tokenizer) do not need UNK because every byte is in the vocabulary. Older tokenizers that operate at the Unicode codepoint level may still produce UNK for very rare characters.
Chat Template Tokens
Modern instruction-following models use additional special tokens to delineate the roles within a conversation:
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant.
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is BPE?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
These tokens serve a critical function: they let the model distinguish between system instructions, user input, and its own output. During instruction tuning, the model learns that content after <|start_header_id|>system<|end_header_id|> should be treated as behavioral constraints, and content after the assistant header is where it should generate its response.
The exact token format varies by model family. Llama 3 uses the format shown above. ChatGPT uses <|im_start|> and <|im_end|>. Mistral uses [INST] and [/INST]. Getting these tokens wrong — for example, using ChatGPT’s template format with a Llama model — will produce garbage output because the model has never seen those token sequences during training.
Using the wrong chat template is one of the most common deployment errors. A model will not crash or throw an error — it will simply produce degraded output because the special token sequences do not match what it learned during instruction tuning. Always use the template specified in the model’s tokenizer configuration.
EOS and Inference Cost
The EOS token has a direct performance implication: it determines when generation stops. A model that does not reliably produce EOS will generate tokens until hitting the maximum length limit, wasting compute and producing trailing garbage. Conversely, a model that produces EOS too early will truncate its responses.
During beam search or sampling, EOS detection must be checked at every decode step. Some serving frameworks implement “stop sequences” — multi-token patterns that trigger early stopping — which require the tokenizer to check whether the most recent tokens match a stop pattern. This is a small but measurable overhead at each decode step.
Part 6: Tokenization Artifacts and Failure Modes
Tokenization is not a neutral preprocessing step. It introduces artifacts — systematic biases and failure modes that propagate through the entire model. Some of these are well-known; others are subtle and only manifest in specific contexts.
Glitch Tokens: The SolidGoldMagikarp Problem
In 2023, researchers discovered that certain tokens in GPT-2 and GPT-3’s vocabulary produced bizarre behavior when used as prompts. The most famous example was the token for ” SolidGoldMagikarp” — a Reddit username that appeared frequently enough in the training data to earn its own token during BPE training, but then appeared in very different contexts (or not at all) during the model’s language model training.
The mechanism is straightforward. BPE vocabulary training and language model training use different datasets (or the same dataset is processed differently). A token can appear frequently enough in the BPE training corpus to be assigned its own vocabulary entry, but then appear rarely or in unusual contexts during model training. The model’s embedding for this token is undertrained — it occupies a random region of embedding space and produces unpredictable outputs when encountered.
This failure mode is specific to the gap between tokenizer training and model training. If both use identical data and the token has sufficient frequency in both stages, the problem does not arise.
Token Boundary Effects on Arithmetic
GPT models famously struggle with arithmetic. A significant part of this difficulty comes from tokenization. Consider the number “12345”:
- It might be tokenized as
"123" + "45"(two tokens) - Or as
"1234" + "5"(two tokens with a different split) - Or as
"12345"(one token, if it is in the vocabulary)
The model does not see the individual digits — it sees arbitrary substrings of digits, with boundaries that depend on what numbers appeared in the BPE training corpus. The number “380” might be a single token while “381” is split into “38” + “1”. This means the model must learn arithmetic not over digits but over inconsistent substrings of digits. The operation “380 + 1 = 381” requires the model to understand that adding 1 to the single token “380” produces the two tokens “38” + “1” — a fundamentally different representation.
Calculate: 1273 + 5684
The answer is 6957.
Each digit is clearly visible:
1 2 7 3
+ 5 6 8 4
-------
6 9 5 7
Calculate: |12|73| |+| |56|84|
The answer is |69|57|.
Token boundaries obscure digits:
|12|73|
+ |56|84|
-----------
|69|57|
Character Counting Failures
“How many r’s are in ‘strawberry’?” is a famously difficult question for LLMs. The reason is tokenization: “strawberry” is typically tokenized as "str" + "aw" + "berry" or similar. The model never sees individual characters — it sees subword tokens. To count the occurrences of “r” in “strawberry”, the model would need to decompose its tokens back into characters, which is an operation it was never explicitly trained to perform. The model must implicitly learn the character composition of every token in its vocabulary — a task that is possible in principle but difficult in practice.
Multi-Byte UTF-8 and Token Boundaries
UTF-8 encodes Unicode characters using 1 to 4 bytes. ASCII characters use 1 byte, common accented characters use 2 bytes, CJK characters use 3 bytes, and emoji use 4 bytes. In byte-level BPE, each byte is a base vocabulary entry. A CJK character is initially represented as 3 separate byte tokens.
If the BPE training corpus contains enough CJK text, these 3-byte sequences will frequently co-occur and merge into single tokens. But if the training corpus is English-heavy, the byte sequences for CJK characters may never merge completely, leaving CJK text represented as 2-3x more tokens than necessary.
This is the byte-level fallback’s hidden cost: it guarantees coverage (no UNK tokens) but the quality of coverage depends entirely on the BPE training corpus distribution. GPT-4’s 100K vocabulary includes many merged CJK byte sequences, but a 32K vocabulary trained on English-heavy data will represent Chinese text with far more tokens per character.
A tokenizer trained on English-heavy data can make Chinese or Arabic text 2-3x more expensive to process than English text of equivalent semantic content. This is not a model limitation — it is a tokenizer limitation. The fix is a larger vocabulary with more diverse training data.
Part 7: Performance — Tokenizer Throughput
Tokenization is a CPU operation. It happens before any GPU computation and after all GPU computation (for decoding tokens back to text). In most inference pipelines, it is fast enough that nobody thinks about it. But at extreme scale, tokenizer throughput can become relevant.
Pure Python vs Rust-backed Implementations
The performance gap between a naive Python tokenizer and a compiled implementation is enormous. A pure Python BPE implementation processes on the order of 10-50 KB/s. The Rust-backed implementations (tiktoken, HuggingFace tokenizers) process 10-100 MB/s — a 1,000-2,000x speedup.
This gap exists because BPE encoding requires intensive string manipulation: scanning for pairs, performing replacements, rescanning. Python’s interpreter overhead and string immutability make this extremely slow. Rust’s zero-cost abstractions and mutable string handling eliminate this overhead almost entirely.
Tokenizer Throughput (Single Thread, English Text)
(MB/s)Batched Tokenization
All three major implementations support batched tokenization — processing multiple strings in parallel. HuggingFace tokenizers can parallelize across CPU cores using its built-in parallelism:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-3-8B")
# Single string - uses one core
encoded = tokenizer.encode("Hello world")
# Batch of strings - automatically parallelized across cores
batch_encoded = tokenizer.encode_batch([
"First document...",
"Second document...",
"Third document...",
# ... thousands of documents
])
With 8 CPU cores, batched tokenization achieves roughly 6-7x the single-thread throughput (not a perfect 8x due to overhead and memory contention).
Pre-Tokenization as a Parallelism Opportunity
The regex-based pre-tokenization step (splitting text into words/chunks before applying BPE) is embarrassingly parallel. Each chunk can be BPE-encoded independently because BPE merges never cross chunk boundaries. This is why tiktoken’s regex-first design is particularly fast: the regex split produces many small independent chunks that can be processed in parallel, while the BPE encoding of each chunk is a small, cache-friendly operation.
When Throughput Matters
For the vast majority of applications, tokenizer throughput is irrelevant. At 80 MB/s, tiktoken can tokenize a 1 MB document in 12.5 milliseconds — far less than the time the GPU spends processing the resulting tokens. Even for a 100 MB corpus, tokenization takes about 1.25 seconds single-threaded.
Tokenizer throughput becomes relevant in two scenarios:
Offline data preprocessing. When tokenizing terabytes of training data for LLM pre-training, the total tokenization time can be hours or days. Using a fast Rust-backed tokenizer instead of a Python implementation saves significant wall clock time. The Llama 3 training corpus of ~15 trillion tokens required tokenizing petabytes of raw text — even at 80 MB/s, this is weeks of single-core time.
Ultra-high-throughput serving. If your inference server handles thousands of requests per second, the CPU time for tokenization adds up. At 10,000 requests/s with an average 500 bytes per request, you need 5 MB/s of sustained tokenization throughput — easily achievable with any compiled implementation, but potentially problematic with pure Python.
If your inference pipeline is slow, the tokenizer is almost certainly not the bottleneck. Profile first. The GPU time for a single forward pass through a 7B-parameter model is 10-100 ms; tokenizing the input takes under 1 ms. The tokenizer is 100-1000x faster than the model.
Part 8: The Connection to Inference Cost
Tokenization directly controls inference cost through two mechanisms: the number of tokens in the input (which determines prompt processing cost) and the number of tokens in the output (which determines generation cost). Understanding this connection is critical for capacity planning and cost optimization.
Fewer Tokens = Faster Inference = Lower Cost
The cost of processing a prompt is dominated by the attention computation, which scales as in the number of tokens. The cost of generating each output token is dominated by KV cache reads, which scale as in the current sequence length. In both cases, fewer tokens means less work.
A tokenizer that produces 20% fewer tokens for the same text reduces:
- Prefill compute by roughly 36% (, so reduction)
- KV cache memory by 20% (linear in token count)
- Generation cost by 20% (fewer decode steps to produce the same text, fewer KV entries to read per step)
- API cost by 20% (most APIs bill per token)
This is why Llama 3’s move from a 32K to a 128K vocabulary was significant. For Chinese text, the token count dropped by roughly 50% compared to Llama 1’s 32K vocabulary. That is a 75% reduction in prefill attention cost and a 50% reduction in generation cost — for the same text, with the same model quality.
Token Reduction Impact on Inference Cost (Llama 3 vs Llama 1 Tokenizer)
(% of Llama 1 cost)The Embedding Table Tradeoff
The cost of a larger vocabulary is not free. The embedding table and LM head must be loaded from HBM at least once per forward pass. During autoregressive decoding, the LM head computation ( matrix-vector multiply) happens at every single decode step.
For a 128K vocabulary with :
- Embedding table: MB in FP16
- LM head weight read per decode step: 1,000 MB
- At 2 TB/s HBM bandwidth: 0.5 ms per decode step just for the LM head
Compare to a 32K vocabulary:
- LM head weight read: 250 MB per step
- At 2 TB/s: 0.125 ms per step
The difference is 0.375 ms per decode step. For a 500-token generation, that is 187.5 ms of additional latency — noticeable but not catastrophic. And this cost is fully offset if the 128K vocabulary produces even a modest reduction in the number of decode steps needed.
KV Cache: Where the Real Savings Are
The largest inference cost saving from better tokenization comes through the KV cache. Every token in the sequence occupies KV cache memory in every layer of the model:
For Llama-3 8B (, , FP16): KB per token per request.
A 20% reduction in token count saves 20% KV cache memory. In a serving system with limited GPU memory, this means:
- 25% more concurrent requests (more requests fit in the same KV cache budget)
- 25% higher throughput
- 20% lower per-request latency (less KV cache to read during each decode step)
At scale, this is enormous. A serving cluster handling 10,000 concurrent requests with an average context length of 4,000 tokens consumes KB TB of KV cache memory. A 20% reduction saves 3.9 TB — enough to serve 2,500 additional concurrent requests, or to reduce the cluster size by 20%.
Tokenizer Impact on Serving Economics
| Metric | 32K Vocab | 128K Vocab | Savings |
|---|---|---|---|
| Tokens per 1K chars (Chinese) | ~700 | ~320 | 54% fewer tokens |
| KV cache per request (4K context) | 2.0 GB | 0.9 GB | 55% less memory |
| Concurrent requests (80GB GPU) | 40 | 88 | 2.2x more requests |
| Embedding table memory | 250 MB | 1,000 MB | +750 MB overhead |
| LM head latency per step | 0.125 ms | 0.5 ms | +0.375 ms per step |
| Net throughput gain (Chinese) | -- | -- | +80% to +120% |
The Bottom Line
Tokenization is not a preprocessing detail — it is a core economic parameter of LLM deployment. The tokenizer determines how many tokens your users consume, which directly determines your compute cost, memory footprint, and serving capacity. When evaluating models for production deployment, the tokenizer’s compression ratio on your specific data distribution should be part of the evaluation criteria alongside perplexity, accuracy, and latency.
Conclusion
Tokenization sits at the boundary between human-readable text and the numerical representations that transformers operate on. It is the first transformation in the pipeline and it shapes everything downstream.
The key insights from this deep dive:
Subword tokenization is a compression problem. BPE finds the most compressible representation of the training corpus within a vocabulary budget. Common sequences get short encodings (single tokens); rare sequences get long encodings (multiple tokens). The vocabulary budget controls the tradeoff between sequence length and embedding table size.
Vocabulary size is a first-order design decision. The difference between 32K and 128K vocabulary is not academic — it determines whether Chinese text costs 2x or 1x what English text costs to process. For multilingual or code-heavy workloads, a large vocabulary pays for itself many times over through reduced inference cost.
Tokenization has failure modes. Glitch tokens, arithmetic difficulties, character-counting failures, and the multilingual tax are all direct consequences of how the tokenizer splits text. These are not model failures — they are tokenizer failures that the model cannot overcome.
The tokenizer-inference cost connection is quantitative and significant. A 20% reduction in token count translates to 36% less prefill compute, 20% less KV cache memory, and 25% more concurrent requests. At serving scale, this is worth millions of dollars per year.
In Part 3 of this series, we will follow the tokens as they pass through the embedding layer and positional encoding — the mechanism that gives the transformer its sense of position within a sequence. We will see how token IDs become dense vectors, why positional information cannot be baked into the embeddings alone, and how RoPE and ALiBi changed the game for long-context models.
Part 1: The Transformer Attention Mechanism — How attention works, its cost, prefill vs decode profiles. Part 2: Tokenization and BPE (this post) — How text becomes tokens, vocabulary tradeoffs, the cost connection. Part 3: Embeddings and Positional Encoding (coming next) — From token IDs to dense vectors, RoPE, ALiBi.