Inference cost
- Definition
- The expense of running an AI model in production, typically measured per million tokens. Inference costs have dropped 10-100x in the past two years, enabling new business models and use cases.
- Why it matters
- Inference cost is the single most important economic variable in AI deployment. It determines your gross margin, which use cases are viable, and whether you can afford to run AI at scale. The cost curve matters more than the current price: costs dropping 10x per year means that a use case that is uneconomical today will be trivially cheap in 18 months. This creates a strategic imperative to build the infrastructure and product surfaces now, before the economics fully arrive. Companies that wait for costs to drop before building will find that competitors who invested early have already locked in users and data flywheels.
- In practice
- GPT-4 launched at $60/M output tokens in March 2023. GPT-4o Mini launched at $0.60/M output tokens in July 2024, a 100x reduction in 16 months for comparable quality on many tasks. Anthropic's Claude pricing followed a similar trajectory. On the self-hosted side, running Llama 3 70B on a single NVIDIA H100 costs roughly $0.20/M tokens, competitive with managed API pricing. DeepSeek's R1 demonstrated frontier reasoning at fraction of the cost. The inference cost decline is driven by hardware improvements, model efficiency gains, quantization, and competitive pressure. At current trajectory, GPT-4-class inference will cost under $0.01/M tokens by 2027.
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Inference
The process of running a trained model to generate predictions or outputs from new inputs. Inference cost per token is the key economic metric for AI deployment and is falling rapidly.
Token pricing
The cost model used by AI API providers, charging per million input and output tokens. Prices have fallen dramatically, from $60/M tokens (GPT-4, 2023) to under $1/M tokens for many models in 2026.
Inference economics
The study of costs, pricing models, and margin structures around running AI models in production, encompassing hardware costs, model efficiency, pricing strategies, and the competitive dynamics of the inference market.
Quantization
Reducing the numerical precision of a model's weights (e.g., from 32-bit to 4-bit) to shrink its memory footprint and speed up inference. Quantization makes it possible to run large models on consumer hardware.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.