Infrastructure & ComputeCore

Inference cost

Definition
The expense of running an AI model in production, typically measured per million tokens. Inference costs have dropped 10-100x in the past two years, enabling new business models and use cases.
Why it matters
Inference cost is the single most important economic variable in AI deployment. It determines your gross margin, which use cases are viable, and whether you can afford to run AI at scale. The cost curve matters more than the current price: costs dropping 10x per year means that a use case that is uneconomical today will be trivially cheap in 18 months. This creates a strategic imperative to build the infrastructure and product surfaces now, before the economics fully arrive. Companies that wait for costs to drop before building will find that competitors who invested early have already locked in users and data flywheels.
In practice
GPT-4 launched at $60/M output tokens in March 2023. GPT-4o Mini launched at $0.60/M output tokens in July 2024, a 100x reduction in 16 months for comparable quality on many tasks. Anthropic's Claude pricing followed a similar trajectory. On the self-hosted side, running Llama 3 70B on a single NVIDIA H100 costs roughly $0.20/M tokens, competitive with managed API pricing. DeepSeek's R1 demonstrated frontier reasoning at fraction of the cost. The inference cost decline is driven by hardware improvements, model efficiency gains, quantization, and competitive pressure. At current trajectory, GPT-4-class inference will cost under $0.01/M tokens by 2027.

We cover infrastructure & compute every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.