Technical February 1, 2026

Complete Guide: Q4 Q5 Q8 Quantization Explained

Which quantization to choose for your LLMs? Impact on quality, size, and performance. Everything you need to understand about GGUF and K-quants.

What is quantization?

Quantization is a compression technique that reduces the precision of numbers representing the parameters of a language model. It's a bit like going from a RAW image to a JPEG: you lose a bit of quality, but the file becomes much smaller.

๐ŸŽฏ Why quantize?

  • Reduce size: A 70B model goes from 140GB (FP16) to 35GB (Q4) โ€” 4x smaller!
  • Accelerate inference: Less data to process = faster generation
  • Reduce RAM/VRAM consumption: Allows running large models on modest hardware
  • Make local AI accessible: Without quantization, only datacenters could run modern LLMs

Types of quantization

GGUF format and K-quants

The GGUF (GPT-Generated Unified Format) format is the standard created by the llama.cpp project. It integrates several quantization methods, called K-quants, which use sophisticated techniques to minimize quality loss.

Relative size of an 8B model by quantization:

FP16 Uncompressed 16 GB
Q8_0 High quality 8 GB
Q5_K_M Optimal quality 5 GB
Q4_K_M Good balance 4 GB

Detailed quantization guide

Q4_K_M โ€” The size/quality balance

Bits: ~4.5 bits per parameter (mixed 4 and 6 bits)
Reduction: ~75% compared to FP16
Estimated quality loss: 5-10%

The Q4_K_M (K-quants Mixed) quantization is the recommended default choice for most users. It offers the best compromise between reduced size and preserved quality.

Memory formula: Q4_Size โ‰ˆ (N_params ร— 4.5) / 8 / 1024ยณ GB
Example 8B: (8ร—10โน ร— 4.5) / 8 / 1024ยณ โ‰ˆ 4.2 GB

Q5_K_M โ€” Superior quality

Bits: ~5.5 bits per parameter
Reduction: ~65% compared to FP16
Estimated quality loss: 3-5%

Q5_K_M is the choice of demanding users who have enough RAM. The quality difference compared to the original model is barely noticeable in most use cases.

Q8_0 โ€” Maximum precision

Bits: 8 bits per parameter
Reduction: 50% compared to FP16
Estimated quality loss: 1-2%

Q8_0 is almost indistinguishable from the original FP16 model. Only very precise benchmarks manage to measure a difference.

FP16 / BF16 โ€” Native precision

Bits: 16 bits per parameter
Reduction: None (original format)
Quality loss: 0%

The FP16 (Half Precision) or BF16 (BFloat16) format is the one used during training. It's the absolute reference in terms of quality.

Complete comparison table

Quantization Bits/param 8B Size 70B Size Quality Min RAM
FP16 16 16 GB 140 GB 100% (ref) 24 GB / 144 GB
Q8_0 8 8.5 GB 74 GB ~98% 12 GB / 80 GB
Q5_K_M ~5.5 5.5 GB 48 GB ~96% 8 GB / 52 GB
Q4_K_M ~4.5 4.5 GB 39 GB ~92% 6 GB / 42 GB
Q4_0 4 4.1 GB 36 GB ~88% 6 GB / 38 GB
Q3_K_M ~3.5 3.4 GB 30 GB ~82% 4 GB / 32 GB

How to choose? Our decision matrix

๐ŸŽฏ Choose according to your configuration:

8 GB RAM โ†’ Q4_K_M mandatory, 7-8B models max
16 GB RAM โ†’ Q5_K_M recommended for 8B, Q4_K_M for 13-14B
32 GB RAM โ†’ Q5_K_M for 32B, Q8_0 possible for 8B
64 GB+ RAM โ†’ Q8_0 for 32-70B, maximum quality

Impact on real performance

Quantization affects not only quality but also generation speed:

On modern GPU with INT4/INT8 support (RTX 30xx/40xx, Apple Silicon), acceleration can reach 2-3x compared to FP16 on CPU.

Conclusion

Q4_K_M remains the recommended default choice for most local AI users. It offers 90%+ of the original quality for 25% of the weight โ€” an impressive technological compromise.

If you have the necessary RAM, upgrade to Q5_K_M for professional or development use. Reserve Q8_0 for cases where every percentage of quality counts.

LocalClaw automatically recommends the right quantization based on your detected hardware โ€” no more guessing!