Quantization
- Definition
- Reducing the numerical precision of a model's weights (e.g., from 32-bit to 4-bit) to shrink its memory footprint and speed up inference. Quantization makes it possible to run large models on consumer hardware.
- Why it matters
- Quantization is the most practical optimization for anyone running models locally or managing inference costs. A 70B-parameter model in FP16 requires approximately 140GB of GPU memory; quantized to 4-bit, it fits in approximately 35GB. This is the difference between needing a $200K multi-GPU server and running on a single consumer GPU. The quality trade-off is often minimal: 4-bit quantization typically causes less than 2% degradation on benchmarks, and 8-bit quantization is often indistinguishable from full precision. For companies self-hosting models, quantization can cut infrastructure costs by 50-75% while maintaining acceptable quality. It is the first optimization to apply.
- In practice
- GPTQ and AWQ are the most popular quantization methods for LLMs, supported by every major inference framework. llama.cpp pioneered CPU-based quantized inference, enabling Llama models to run on MacBooks. GGUF format became the standard for distributing quantized models. QLoRA combined quantization with LoRA to enable fine-tuning of 65B models on a single 48GB GPU, a breakthrough for accessibility. In practice, most self-hosted production deployments use 4-bit or 8-bit quantization. The sweet spot depends on the task: creative writing and complex reasoning benefit from higher precision, while classification and extraction tasks are insensitive to quantization.
We cover infrastructure & compute every week.
Get the 5 AI stories that matter — free, every Friday.
Related terms
Inference cost
The expense of running an AI model in production, typically measured per million tokens. Inference costs have dropped 10-100x in the past two years, enabling new business models and use cases.
Parameter
A learnable value inside a neural network that gets adjusted during training. Model size is measured in parameters (e.g., 70B, 405B), which roughly correlates with capability and cost.
Weights
The numerical values inside a neural network that encode everything the model has learned during training. Model weights are the core intellectual property of AI companies and the subject of intense open-vs.-closed debates.
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning technique that injects small trainable matrices into a frozen model. LoRA lets companies customize large models at a fraction of the cost of full fine-tuning.
Know the terms. Know the moves.
Get the 5 AI stories that matter every Friday — free.
Free forever. No spam.