Infrastructure & ComputeDeep Dive

Quantization

Definition
Reducing the numerical precision of a model's weights (e.g., from 32-bit to 4-bit) to shrink its memory footprint and speed up inference. Quantization makes it possible to run large models on consumer hardware.
Why it matters
Quantization is the most practical optimization for anyone running models locally or managing inference costs. A 70B-parameter model in FP16 requires approximately 140GB of GPU memory; quantized to 4-bit, it fits in approximately 35GB. This is the difference between needing a $200K multi-GPU server and running on a single consumer GPU. The quality trade-off is often minimal: 4-bit quantization typically causes less than 2% degradation on benchmarks, and 8-bit quantization is often indistinguishable from full precision. For companies self-hosting models, quantization can cut infrastructure costs by 50-75% while maintaining acceptable quality. It is the first optimization to apply.
In practice
GPTQ and AWQ are the most popular quantization methods for LLMs, supported by every major inference framework. llama.cpp pioneered CPU-based quantized inference, enabling Llama models to run on MacBooks. GGUF format became the standard for distributing quantized models. QLoRA combined quantization with LoRA to enable fine-tuning of 65B models on a single 48GB GPU, a breakthrough for accessibility. In practice, most self-hosted production deployments use 4-bit or 8-bit quantization. The sweet spot depends on the task: creative writing and complex reasoning benefit from higher precision, while classification and extraction tasks are insensitive to quantization.

We cover infrastructure & compute every week.

Get the 5 AI stories that matter — free, every Friday.

Know the terms. Know the moves.

Get the 5 AI stories that matter every Friday — free.

Free forever. No spam.