SLA (Service Level Agreement)
- Definition
- A contract defining uptime, latency, and throughput guarantees for an AI service. Enterprise buyers evaluate AI vendors heavily on SLAs, especially for mission-critical applications.
- Why it matters
- SLAs separate toy products from enterprise infrastructure. When an AI feature is in the critical path of your business (customer support, fraud detection, content moderation), you need contractual guarantees about availability, latency, and error rates. Most AI API providers offer 99.9% uptime SLAs, but the devil is in the details: what counts as downtime, what are the remedies for violations, and what latency percentiles are guaranteed? For engineering leaders, understanding SLA terms prevents nasty surprises when a provider has an outage during your peak traffic. For procurement teams, SLA comparison is one of the most effective ways to evaluate AI vendors, because it reveals how confident they are in their infrastructure.
- In practice
- OpenAI's Enterprise tier offers 99.9% uptime SLAs. Anthropic provides similar guarantees for enterprise customers. AWS Bedrock and Azure AI include SLAs as part of their cloud service agreements. In practice, sophisticated enterprises do not rely on a single provider's SLA: they implement multi-provider architectures with automatic failover. When OpenAI experienced major outages in 2024, companies with Anthropic or Google as backup providers maintained service continuity. The lesson: an SLA is a financial remedy, not a guarantee of uptime. Design your architecture to survive SLA violations, because they will happen.
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Rate limiting
Controls that cap the number of API requests a user or application can make in a given time period. Rate limits are how AI providers manage capacity, prevent abuse, and enforce pricing tiers.
Latency
The time between sending a request to an AI model and receiving the first token of the response. Low latency is critical for real-time applications like coding assistants, voice agents, and live customer support.
Endpoint
A specific URL where an AI model is hosted and accepts API requests. Managing endpoints involves load balancing, rate limiting, and scaling to handle production traffic.
Throughput
The number of tokens or requests an AI system can process per second. High throughput is essential for batch processing, high-traffic applications, and cost-efficient inference at scale.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.