Inference
- Definition
- The process of running a trained model to generate predictions or outputs from new inputs. Inference cost per token is the key economic metric for AI deployment and is falling rapidly.
- Why it matters
- Training gets the headlines, but inference is where the money is made and spent. Every API call, every chatbot response, every agent action is an inference operation. As AI moves from demos to production, inference costs dominate total cost of ownership, often exceeding training costs within months of deployment. The inference cost curve is the most important trend in AI economics: as costs fall 10x per year, previously unviable applications become possible. Companies that optimize inference, through model selection, quantization, caching, and batching, gain a direct margin advantage. Understanding inference economics is not optional for any AI product leader.
- In practice
- OpenAI's GPT-4 Turbo launched at roughly 1/3 the cost of GPT-4, then GPT-4o cut costs further. By 2025, inference costs for GPT-4-class quality had fallen over 100x compared to GPT-4's launch price. Groq's LPU chips and Cerebras's wafer-scale engine compete on inference speed, delivering hundreds of tokens per second. On the self-hosted side, vLLM and TGI optimize GPU utilization for inference workloads. The cost decline has enabled new product categories: real-time voice agents (which require sub-200ms latency), AI-powered code editors (which run dozens of model calls per keystroke), and autonomous research agents (which use thousands of tokens per task).
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Inference cost
The expense of running an AI model in production, typically measured per million tokens. Inference costs have dropped 10-100x in the past two years, enabling new business models and use cases.
Inference economics
The study of costs, pricing models, and margin structures around running AI models in production, encompassing hardware costs, model efficiency, pricing strategies, and the competitive dynamics of the inference market.
Latency
The time between sending a request to an AI model and receiving the first token of the response. Low latency is critical for real-time applications like coding assistants, voice agents, and live customer support.
Throughput
The number of tokens or requests an AI system can process per second. High throughput is essential for batch processing, high-traffic applications, and cost-efficient inference at scale.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.