Context window
- Definition
- The maximum number of tokens a model can process in a single request, including both the prompt and the response. Larger context windows (100K-2M tokens) let models ingest entire codebases or documents at once.
- Why it matters
- Context window size determines what your AI can see and reason about in one shot. A 4K-token window limits you to a few pages; a 1M-token window lets you ingest entire codebases, legal documents, or research corpora. But bigger is not always better: models degrade in quality when context is packed with irrelevant information ('lost in the middle' problem), and longer contexts cost more in compute and money. The strategic question is not just how much context a model supports, but how well it uses that context. A model that reliably retrieves and reasons over 200K tokens beats one that theoretically supports 1M but misses key details.
- In practice
- Google's Gemini 1.5 Pro launched with a 1M-token context window in early 2024 and later expanded to 2M tokens. Anthropic's Claude offers 200K tokens standard, with extended context up to 1M for enterprise customers. OpenAI's GPT-4 Turbo moved from 8K to 128K tokens. In practice, most enterprise applications use 10-50K tokens per request, combining system prompts, retrieved documents, and conversation history. The 'Needle in a Haystack' test became the standard benchmark for measuring how well models actually use their full context window, revealing significant quality differences between providers.
We cover models & architecture every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Token
The basic unit of text that AI models process, roughly equivalent to 3/4 of a word in English. Tokens are how models read, price, and limit input and output, making token efficiency a key cost lever.
Long-context model
An AI model capable of processing extremely long inputs, typically 100K to 2M+ tokens in a single request. Long-context models can ingest entire books, codebases, or document collections without chunking.
KV cache
A memory structure that stores the key and value matrices from previous attention computations during autoregressive generation, avoiding redundant recalculation as each new token is produced. KV caching is essential for efficient inference.
Context engineering
The practice of strategically designing and managing the full context that is fed to an AI model, including system prompts, retrieved documents, conversation history, tool outputs, and structured metadata, to maximize response quality.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.