Question 1

What is Quantization?

Accepted Answer

Reducing the numerical precision of a model's weights (e.g., from 32-bit to 4-bit) to shrink its memory footprint and speed up inference. Quantization makes it possible to run large models on consumer hardware.

Question 2

Why does Quantization matter for business?

Accepted Answer

Quantization is the most practical optimization for anyone running models locally or managing inference costs. A 70B-parameter model in FP16 requires approximately 140GB of GPU memory; quantized to 4-bit, it fits in approximately 35GB. This is the difference between needing a $200K multi-GPU server and running on a single consumer GPU. The quality trade-off is often minimal: 4-bit quantization typically causes less than 2% degradation on benchmarks, and 8-bit quantization is often indistinguishable from full precision. For companies self-hosting models, quantization can cut infrastructure costs by 50-75% while maintaining acceptable quality. It is the first optimization to apply.

Quantization

Related terms

Know the terms. Know the moves.