What Is Model Quantization?
Quantization reduces the numerical precision of a model’s parameters — converting 16-bit or 32-bit floating-point numbers to 8-bit, 4-bit, or even lower. This makes models smaller, faster, and runnable on consumer hardware.
Why Quantize?
A 7B parameter model in full precision (FP16) requires ~14 GB of memory. Quantized:
| Precision | Memory | Quality Impact |
|---|---|---|
| FP16 (full) | 14 GB | Baseline |
| 8-bit (Q8) | 7 GB | Negligible |
| 4-bit (Q4) | 3.5 GB | Minimal |
| 2-bit (Q2) | 1.75 GB | Noticeable |
Common Formats
- GGUF: The standard format for Ollama and llama.cpp. Optimized for CPU and Metal inference on Mac.
- GPTQ: GPU-focused quantization, popular on NVIDIA hardware.
- AWQ: Activation-aware quantization — preserves quality better than naive approaches.
Quantization in Practice
When you download a model in Ollama, you’re typically getting a Q4 or Q5 quantized GGUF. This is why a “7B model” downloads as ~4 GB instead of 14 GB — it’s been quantized to fit comfortably in memory while maintaining most of its quality.
Elvean brings all these concepts together in one native Mac app — local models, cloud APIs, agentic tools, and more.
Learn more about Elvean