Infrastructure & ComputeDeep Dive

Throughput

Definition
The number of tokens or requests an AI system can process per second. High throughput is essential for batch processing, high-traffic applications, and cost-efficient inference at scale.
Why it matters
Throughput determines how much work your AI infrastructure can handle and, by extension, how many users you can serve cost-effectively. Latency tells you how fast a single request completes; throughput tells you how many requests you can handle simultaneously. For applications serving millions of users, throughput is often the binding constraint: you might have acceptable per-request latency but insufficient throughput to handle peak traffic. Optimizing throughput involves batching requests, using continuous batching, maximizing GPU utilization, and choosing the right hardware. For self-hosted deployments, throughput directly determines your cost per request and therefore your gross margin.
In practice
vLLM's continuous batching technique increased throughput by 2-4x compared to static batching by processing new requests as soon as GPU cycles become available. Groq's LPU hardware achieves over 500 tokens per second on Llama models, roughly 10x the throughput of GPU-based serving. NVIDIA's TensorRT-LLM optimizes transformer inference for maximum throughput on NVIDIA hardware. In production, companies monitor throughput in tokens-per-second-per-GPU to evaluate infrastructure efficiency. A well-optimized serving stack on H100s can achieve 2,000-5,000 output tokens per second for a 70B model, compared to 200-500 for naive implementations.

We cover infrastructure & compute every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.