← AI Glossary

What Is Model Quantization?

Quantization reduces the numerical precision of a model’s parameters — converting 16-bit or 32-bit floating-point numbers to 8-bit, 4-bit, or even lower. This makes models smaller, faster, and runnable on consumer hardware.

Why Quantize?

A 7B parameter model in full precision (FP16) requires ~14 GB of memory. Quantized:

PrecisionMemoryQuality Impact
FP16 (full)14 GBBaseline
8-bit (Q8)7 GBNegligible
4-bit (Q4)3.5 GBMinimal
2-bit (Q2)1.75 GBNoticeable

Common Formats

  • GGUF: The standard format for Ollama and llama.cpp. Optimized for CPU and Metal inference on Mac.
  • GPTQ: GPU-focused quantization, popular on NVIDIA hardware.
  • AWQ: Activation-aware quantization — preserves quality better than naive approaches.

Quantization in Practice

When you download a model in Ollama, you’re typically getting a Q4 or Q5 quantized GGUF. This is why a “7B model” downloads as ~4 GB instead of 14 GB — it’s been quantized to fit comfortably in memory while maintaining most of its quality.

Elvean brings all these concepts together in one native Mac app — local models, cloud APIs, agentic tools, and more.

Learn more about Elvean