Infrastructure & ComputeCore

Inference

Definition
The process of running a trained model to generate predictions or outputs from new inputs. Inference cost per token is the key economic metric for AI deployment and is falling rapidly.
Why it matters
Training gets the headlines, but inference is where the money is made and spent. Every API call, every chatbot response, every agent action is an inference operation. As AI moves from demos to production, inference costs dominate total cost of ownership, often exceeding training costs within months of deployment. The inference cost curve is the most important trend in AI economics: as costs fall 10x per year, previously unviable applications become possible. Companies that optimize inference, through model selection, quantization, caching, and batching, gain a direct margin advantage. Understanding inference economics is not optional for any AI product leader.
In practice
OpenAI's GPT-4 Turbo launched at roughly 1/3 the cost of GPT-4, then GPT-4o cut costs further. By 2025, inference costs for GPT-4-class quality had fallen over 100x compared to GPT-4's launch price. Groq's LPU chips and Cerebras's wafer-scale engine compete on inference speed, delivering hundreds of tokens per second. On the self-hosted side, vLLM and TGI optimize GPU utilization for inference workloads. The cost decline has enabled new product categories: real-time voice agents (which require sub-200ms latency), AI-powered code editors (which run dozens of model calls per keystroke), and autonomous research agents (which use thousands of tokens per task).

We cover infrastructure & compute every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.