Long-context model
- Definition
- An AI model capable of processing extremely long inputs, typically 100K to 2M+ tokens in a single request. Long-context models can ingest entire books, codebases, or document collections without chunking.
- Why it matters
- Long-context models change what is architecturally possible. Before long context, you had to chunk documents, build retrieval pipelines, and hope the right chunks were retrieved. With a million-token context window, you can dump an entire codebase, legal contract, or research corpus into a single prompt. This simplifies architectures, reduces engineering complexity, and often improves quality by letting the model see the full picture. But long context is not free: longer inputs cost more, and models still struggle with the 'lost in the middle' problem where information in the middle of long contexts gets less attention than information at the beginning or end.
- In practice
- Google's Gemini 1.5 Pro was the first commercial model to offer a 1M-token context window, later expanded to 2M tokens. Anthropic's Claude supports 200K tokens standard. Magic AI raised $320M to build a 100M-token context model for code understanding. In practice, enterprise use cases include: processing entire codebases for migration planning, analyzing complete legal discovery document sets, reviewing quarter-end financial filings, and ingesting full research paper corpora. The 'Needle in a Haystack' test showed that leading models can reliably retrieve specific information from contexts of 200K+ tokens, though performance varies by position in the context.
We cover models & architecture every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Context window
The maximum number of tokens a model can process in a single request, including both the prompt and the response. Larger context windows (100K-2M tokens) let models ingest entire codebases or documents at once.
KV cache
A memory structure that stores the key and value matrices from previous attention computations during autoregressive generation, avoiding redundant recalculation as each new token is produced. KV caching is essential for efficient inference.
Attention mechanism
The core innovation inside transformers that lets a model weigh the relevance of every token against every other token in a sequence. Attention is what makes modern LLMs understand context and long-range dependencies.
RAG (Retrieval-Augmented Generation)
A technique that retrieves relevant documents from an external knowledge base and feeds them to a model alongside the user's query. RAG reduces hallucination and keeps responses grounded in current, factual data.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.