Question 1

What is Latency?

Accepted Answer

The time between sending a request to an AI model and receiving the first token of the response. Low latency is critical for real-time applications like coding assistants, voice agents, and live customer support.

Question 2

Why does Latency matter for business?

Accepted Answer

Latency determines user experience. Research consistently shows that response times above 2 seconds cause significant user drop-off, and above 5 seconds, most users abandon the interaction. For real-time applications like voice agents, even 500ms of latency feels unnatural. Latency optimization involves every layer of the stack: model architecture, hardware selection, geographic placement, inference engine, and network. The trade-off between latency and cost is a core architectural decision: smaller models are faster but less capable; larger models are more capable but slower. Streaming (returning tokens as they are generated) mitigates perceived latency but not actual time-to-completion.

Latency

Related terms

Know the terms. Know the moves.