Models & ArchitectureDeep Dive

Test-time compute

Definition
The practice of allocating additional compute during inference to improve output quality, rather than relying solely on the capabilities baked in during training. Reasoning models and extended thinking are the primary examples of test-time compute scaling.
Why it matters
Test-time compute introduces a second scaling axis for AI capability. The first axis is training compute: bigger models trained on more data. The second is inference compute: more thinking time per problem. This is revolutionary because it means you can improve AI outputs without retraining the model, simply by spending more compute at inference time. For product developers, test-time compute creates a natural price-quality dial: simple questions get fast, cheap responses, while complex questions trigger extended reasoning at higher cost. For the industry, test-time compute scaling could sustain capability improvements even if training scaling laws begin to plateau.
In practice
OpenAI's o1 model demonstrated that spending 10-100x more inference compute on chain-of-thought reasoning dramatically improved performance on math, science, and coding tasks. The model scored in the 89th percentile on the AMC math competition, compared to roughly the 50th percentile for GPT-4. Anthropic's Claude extended thinking mode allocates variable compute based on problem complexity. DeepSeek-R1 brought test-time compute scaling to open-weight models. The economics are notable: a reasoning model might spend $0.50 solving a problem that a standard model attempts for $0.01, but if the reasoning model gets the right answer and the standard model does not, the cost is justified for high-value tasks.

We cover models & architecture every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.