Autoregressive model
- Definition
- A model that generates output one token at a time, with each new token conditioned on all previous tokens. GPT, Claude, and Gemini are all autoregressive, which is why they stream responses word by word.
- Why it matters
- The autoregressive architecture is both the strength and limitation of current LLMs. Its strength: each token is generated with full context of everything before it, producing coherent long-form output. Its limitation: generation is inherently sequential, making inference speed proportional to output length. This is why a 10,000-token response takes 10x longer than a 1,000-token response, and why speculative decoding and parallel generation techniques are hot research areas. For builders, understanding autoregressive generation explains why streaming matters, why output length affects cost, and why certain tasks (like editing a paragraph in place) are awkward for current models.
- In practice
- Every major commercial LLM, including GPT-4, Claude, Gemini, and Llama, uses autoregressive generation. The architecture means that each token costs roughly the same amount of compute, which is why API providers price output tokens separately from input tokens (output tokens are more expensive because each requires a full forward pass). Speculative decoding, used by Google and others, partially circumvents the sequential bottleneck by drafting multiple tokens with a small model and verifying them in parallel with the large model, achieving 2-3x speedups.
We cover models & architecture every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Token
The basic unit of text that AI models process, roughly equivalent to 3/4 of a word in English. Tokens are how models read, price, and limit input and output, making token efficiency a key cost lever.
Transformer
The neural network architecture behind virtually all modern language and multi-modal models. Introduced in Google's 2017 'Attention Is All You Need' paper, transformers use self-attention to process sequences in parallel.
Inference
The process of running a trained model to generate predictions or outputs from new inputs. Inference cost per token is the key economic metric for AI deployment and is falling rapidly.
Speculative decoding
An inference optimization where a small, fast 'draft' model generates candidate tokens that a larger 'verifier' model checks in parallel, speeding up generation without changing output quality.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.