What Is Model Quantization?

Quantization reduces the numerical precision of a model’s parameters — converting 16-bit or 32-bit floating-point numbers to 8-bit, 4-bit, or even lower. This makes models smaller, faster, and runnable on consumer hardware.

Why Quantize?

A 7B parameter model in full precision (FP16) requires ~14 GB of memory. Quantized:

Precision	Memory	Quality Impact
FP16 (full)	14 GB	Baseline
8-bit (Q8)	7 GB	Negligible
4-bit (Q4)	3.5 GB	Minimal
2-bit (Q2)	1.75 GB	Noticeable

Common Formats

GGUF: The standard format for Ollama and llama.cpp. Optimized for CPU and Metal inference on Mac.
GPTQ: GPU-focused quantization, popular on NVIDIA hardware.
AWQ: Activation-aware quantization — preserves quality better than naive approaches.

Quantization in Practice

When you download a model in Ollama, you’re typically getting a Q4 or Q5 quantized GGUF. This is why a “7B model” downloads as ~4 GB instead of 14 GB — it’s been quantized to fit comfortably in memory while maintaining most of its quality.

What Is Model Quantization?

Why Quantize?

Common Formats

Quantization in Practice

Elvean is Mac-only (for now)