Models & ArchitectureDeep Dive

Autoregressive model

Definition
A model that generates output one token at a time, with each new token conditioned on all previous tokens. GPT, Claude, and Gemini are all autoregressive, which is why they stream responses word by word.
Why it matters
The autoregressive architecture is both the strength and limitation of current LLMs. Its strength: each token is generated with full context of everything before it, producing coherent long-form output. Its limitation: generation is inherently sequential, making inference speed proportional to output length. This is why a 10,000-token response takes 10x longer than a 1,000-token response, and why speculative decoding and parallel generation techniques are hot research areas. For builders, understanding autoregressive generation explains why streaming matters, why output length affects cost, and why certain tasks (like editing a paragraph in place) are awkward for current models.
In practice
Every major commercial LLM, including GPT-4, Claude, Gemini, and Llama, uses autoregressive generation. The architecture means that each token costs roughly the same amount of compute, which is why API providers price output tokens separately from input tokens (output tokens are more expensive because each requires a full forward pass). Speculative decoding, used by Google and others, partially circumvents the sequential bottleneck by drafting multiple tokens with a small model and verifying them in parallel with the large model, achieving 2-3x speedups.

We cover models & architecture every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.