Models & ArchitectureDeep Dive

Tokenizer

Definition
The algorithm that splits text into tokens before a model can process it. Different models use different tokenizers, which affects how efficiently they handle various languages, code, and specialized content.
Why it matters
Tokenizer choice has downstream effects that most people overlook. A bad tokenizer wastes tokens (and therefore money) on common patterns, handles non-English languages inefficiently, and can even affect model quality. Tokenizer design decisions made during pre-training are permanent: you cannot change a model's tokenizer after training without retraining from scratch. For multilingual applications, tokenizer efficiency varies dramatically: a tokenizer optimized for English might use 2-3x more tokens for Chinese or Arabic text, meaning those languages cost 2-3x more to process. This has real implications for global AI deployment and pricing fairness.
In practice
GPT-4o's tokenizer uses approximately 25% fewer tokens than GPT-4's for the same English text, and the improvement is even larger for non-English languages. This was achieved by training the tokenizer on a more multilingual corpus with a larger vocabulary (200K vs. 100K tokens). Llama uses SentencePiece with a 32K vocabulary. Claude uses its own tokenizer optimized for code and natural language. For developers, tokenizer differences mean that token counts are not directly comparable across models: 1,000 tokens in GPT-4 might represent 750 tokens in GPT-4o. Most API providers offer tokenizer libraries (tiktoken for OpenAI, sentencepiece for Llama) so developers can estimate costs before making API calls.

We cover models & architecture every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.