Tokenizer
- Definition
- The algorithm that splits text into tokens before a model can process it. Different models use different tokenizers, which affects how efficiently they handle various languages, code, and specialized content.
- Why it matters
- Tokenizer choice has downstream effects that most people overlook. A bad tokenizer wastes tokens (and therefore money) on common patterns, handles non-English languages inefficiently, and can even affect model quality. Tokenizer design decisions made during pre-training are permanent: you cannot change a model's tokenizer after training without retraining from scratch. For multilingual applications, tokenizer efficiency varies dramatically: a tokenizer optimized for English might use 2-3x more tokens for Chinese or Arabic text, meaning those languages cost 2-3x more to process. This has real implications for global AI deployment and pricing fairness.
- In practice
- GPT-4o's tokenizer uses approximately 25% fewer tokens than GPT-4's for the same English text, and the improvement is even larger for non-English languages. This was achieved by training the tokenizer on a more multilingual corpus with a larger vocabulary (200K vs. 100K tokens). Llama uses SentencePiece with a 32K vocabulary. Claude uses its own tokenizer optimized for code and natural language. For developers, tokenizer differences mean that token counts are not directly comparable across models: 1,000 tokens in GPT-4 might represent 750 tokens in GPT-4o. Most API providers offer tokenizer libraries (tiktoken for OpenAI, sentencepiece for Llama) so developers can estimate costs before making API calls.
We cover models & architecture every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Token
The basic unit of text that AI models process, roughly equivalent to 3/4 of a word in English. Tokens are how models read, price, and limit input and output, making token efficiency a key cost lever.
Token pricing
The cost model used by AI API providers, charging per million input and output tokens. Prices have fallen dramatically, from $60/M tokens (GPT-4, 2023) to under $1/M tokens for many models in 2026.
LLM (Large Language Model)
A neural network trained on massive text corpora to predict and generate language. LLMs like GPT-4, Claude, and Gemini are the foundation of the current AI wave, powering chatbots, coding tools, and enterprise automation.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.