What is quantization?
Quantization is a compression technique that reduces the precision of numbers representing the parameters of a language model. It's a bit like going from a RAW image to a JPEG: you lose a bit of quality, but the file becomes much smaller.
๐ฏ Why quantize?
- Reduce size: A 70B model goes from 140GB (FP16) to 35GB (Q4) โ 4x smaller!
- Accelerate inference: Less data to process = faster generation
- Reduce RAM/VRAM consumption: Allows running large models on modest hardware
- Make local AI accessible: Without quantization, only datacenters could run modern LLMs
Types of quantization
GGUF format and K-quants
The GGUF (GPT-Generated Unified Format) format is the standard created by the llama.cpp project. It integrates several quantization methods, called K-quants, which use sophisticated techniques to minimize quality loss.
Relative size of an 8B model by quantization:
Detailed quantization guide
Q4_K_M โ The size/quality balance
Bits: ~4.5 bits per parameter (mixed 4 and 6 bits)
Reduction: ~75% compared to FP16
Estimated quality loss: 5-10%
The Q4_K_M (K-quants Mixed) quantization is the recommended default choice for most users. It offers the best compromise between reduced size and preserved quality.
- โ Ideal for: 8-16GB RAM, general use, chat
- โ Advantages: Reasonable size, very acceptable quality
- โ ๏ธ Limits: Slight loss on complex tasks (advanced math, precise code)
Example 8B: (8ร10โน ร 4.5) / 8 / 1024ยณ โ 4.2 GB
Q5_K_M โ Superior quality
Bits: ~5.5 bits per parameter
Reduction: ~65% compared to FP16
Estimated quality loss: 3-5%
Q5_K_M is the choice of demanding users who have enough RAM. The quality difference compared to the original model is barely noticeable in most use cases.
- โ Ideal for: 16-32GB RAM, development, professional writing
- โ Advantages: Near-native quality, excellent for code
- โ ๏ธ Limits: File 25% larger than Q4
Q8_0 โ Maximum precision
Bits: 8 bits per parameter
Reduction: 50% compared to FP16
Estimated quality loss: 1-2%
Q8_0 is almost indistinguishable from the original FP16 model. Only very precise benchmarks manage to measure a difference.
- โ Ideal for: 32GB+ RAM, professional use, research
- โ Advantages: Maximum quality, no compromises
- โ ๏ธ Limits: Significant size, requires powerful hardware
FP16 / BF16 โ Native precision
Bits: 16 bits per parameter
Reduction: None (original format)
Quality loss: 0%
The FP16 (Half Precision) or BF16 (BFloat16) format is the one used during training. It's the absolute reference in terms of quality.
- โ Ideal for: Datacenters, fine-tuning, academic research
- โ ๏ธ Inaccessible: A 70B model is 140GB โ impossible to run locally for most
Complete comparison table
| Quantization | Bits/param | 8B Size | 70B Size | Quality | Min RAM |
|---|---|---|---|---|---|
| FP16 | 16 | 16 GB | 140 GB | 100% (ref) | 24 GB / 144 GB |
| Q8_0 | 8 | 8.5 GB | 74 GB | ~98% | 12 GB / 80 GB |
| Q5_K_M | ~5.5 | 5.5 GB | 48 GB | ~96% | 8 GB / 52 GB |
| Q4_K_M | ~4.5 | 4.5 GB | 39 GB | ~92% | 6 GB / 42 GB |
| Q4_0 | 4 | 4.1 GB | 36 GB | ~88% | 6 GB / 38 GB |
| Q3_K_M | ~3.5 | 3.4 GB | 30 GB | ~82% | 4 GB / 32 GB |
How to choose? Our decision matrix
๐ฏ Choose according to your configuration:
Impact on real performance
Quantization affects not only quality but also generation speed:
- Q4_K_M: ~20-30% faster than FP16 (less voluminous data to process)
- Q5_K_M: ~15-20% faster than FP16
- Q8_0: Speed equivalent to FP16 (same calculation precision)
On modern GPU with INT4/INT8 support (RTX 30xx/40xx, Apple Silicon), acceleration can reach 2-3x compared to FP16 on CPU.
Conclusion
Q4_K_M remains the recommended default choice for most local AI users. It offers 90%+ of the original quality for 25% of the weight โ an impressive technological compromise.
If you have the necessary RAM, upgrade to Q5_K_M for professional or development use. Reserve Q8_0 for cases where every percentage of quality counts.
LocalClaw automatically recommends the right quantization based on your detected hardware โ no more guessing!