Throughput
- Definition
- The number of tokens or requests an AI system can process per second. High throughput is essential for batch processing, high-traffic applications, and cost-efficient inference at scale.
- Why it matters
- Throughput determines how much work your AI infrastructure can handle and, by extension, how many users you can serve cost-effectively. Latency tells you how fast a single request completes; throughput tells you how many requests you can handle simultaneously. For applications serving millions of users, throughput is often the binding constraint: you might have acceptable per-request latency but insufficient throughput to handle peak traffic. Optimizing throughput involves batching requests, using continuous batching, maximizing GPU utilization, and choosing the right hardware. For self-hosted deployments, throughput directly determines your cost per request and therefore your gross margin.
- In practice
- vLLM's continuous batching technique increased throughput by 2-4x compared to static batching by processing new requests as soon as GPU cycles become available. Groq's LPU hardware achieves over 500 tokens per second on Llama models, roughly 10x the throughput of GPU-based serving. NVIDIA's TensorRT-LLM optimizes transformer inference for maximum throughput on NVIDIA hardware. In production, companies monitor throughput in tokens-per-second-per-GPU to evaluate infrastructure efficiency. A well-optimized serving stack on H100s can achieve 2,000-5,000 output tokens per second for a 70B model, compared to 200-500 for naive implementations.
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Latency
The time between sending a request to an AI model and receiving the first token of the response. Low latency is critical for real-time applications like coding assistants, voice agents, and live customer support.
Inference
The process of running a trained model to generate predictions or outputs from new inputs. Inference cost per token is the key economic metric for AI deployment and is falling rapidly.
Batch processing
Running multiple AI inference requests together to maximize throughput and reduce per-request cost. Batch processing is how companies handle large-scale data labeling, content generation, and analytics workloads efficiently.
GPU (Graphics Processing Unit)
The hardware chip that powers AI training and inference. NVIDIA's H100 and B200 GPUs are the most sought-after compute in the industry, with wait times and pricing driving major strategic decisions.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.