Rate limiting
- Definition
- Controls that cap the number of API requests a user or application can make in a given time period. Rate limits are how AI providers manage capacity, prevent abuse, and enforce pricing tiers.
- Why it matters
- Rate limits are the hidden constraint that determines what you can actually build. A model's benchmark score is irrelevant if the rate limit prevents you from using it at scale. Rate limits vary dramatically by provider, tier, and model: free tiers may allow 10 requests/minute, while enterprise tiers allow thousands. For architects, rate limits drive design decisions: caching, request queuing, fallback providers, and batch processing all exist partly because of rate limits. Understanding rate limits before committing to a provider prevents painful surprises when you try to scale from proof-of-concept to production.
- In practice
- OpenAI's rate limits range from 500 RPM (requests per minute) on Tier 1 to 10,000+ RPM on Tier 5, based on spending history. Anthropic uses a token-based rate limiting system with per-model and per-tier limits. Google's Gemini API has rate limits that vary by model and region. In production, companies implement multi-provider routing: if one provider hits its rate limit, requests automatically route to a backup provider. Vercel's AI Gateway handles this automatically. Caching is also essential: identical or similar requests should return cached results rather than consuming rate limit quota. Companies running AI features for millions of users typically need enterprise-level rate limits and multiple provider relationships.
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
API (Application Programming Interface)
The programmatic interface that lets developers send prompts to an AI model and receive responses. Model vendors like OpenAI, Anthropic, and Google monetize primarily through API access, priced per token.
Endpoint
A specific URL where an AI model is hosted and accepts API requests. Managing endpoints involves load balancing, rate limiting, and scaling to handle production traffic.
Throughput
The number of tokens or requests an AI system can process per second. High throughput is essential for batch processing, high-traffic applications, and cost-efficient inference at scale.
SLA (Service Level Agreement)
A contract defining uptime, latency, and throughput guarantees for an AI service. Enterprise buyers evaluate AI vendors heavily on SLAs, especially for mission-critical applications.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.