Technical 6 min read January 20, 2026

Quantization Explained: Q4 vs Q5 vs Q8

The complete guide to LLM compression. Which quantization gives you the best speed/quality trade-off?

TL;DR: Q5_K_M is the sweet spot for most users — 70% smaller than the original model with minimal quality loss. Use Q4_K_M if you're tight on RAM. Only use Q8_0 if you need maximum quality and have abundant memory.

What Is Quantization?

Quantization is compression for AI models. Think of it like JPEG for images or MP3 for audio — it reduces file size while (mostly) preserving quality. For LLMs, this means fitting larger, smarter models into limited RAM or VRAM.

Without quantization, a 70-billion parameter model requires 140 GB of memory at full precision. That's impossible for consumer hardware. With Q4 quantization, the same model fits in under 40 GB — accessible to high-end consumer machines.

The Quantization Levels Explained

Format Bits per Parameter Size Reduction Quality
FP16 (Full) 16 bits 100% (baseline) ★★★★★ Perfect
Q8_0 8 bits ~50% of original ★★★★☆ Excellent
Q6_K 6 bits ~62% reduction ★★★★☆ Very Good
Q5_K_M ⭐ 5 bits ~70% reduction ★★★★☆ Best Balance
Q4_K_M 4 bits ~75% reduction ★★★☆☆ Good
Q3_K_M 3 bits ~82% reduction ★★☆☆☆ Acceptable
Q2_K 2 bits ~87% reduction ★☆☆☆☆ Poor

What Do the Letters Mean?

K

K-quants. Uses "K-means" clustering for better quantization. Generally higher quality than legacy methods.

_M / _S

Medium or Small mix. Different weight distributions. _M is balanced; _S prioritizes smaller size.

_0

Legacy format. Simpler but less efficient. Still used for Q8 for maximum compatibility.

Which Should You Choose?

🎯 Q5_K_M — The Sweet Spot (Recommended)

Best balance of quality and size. You lose ~5-10% quality compared to full precision, but the model is 70% smaller.

  • Best for: General chat, coding, writing
  • RAM needed: Model size + 2-3GB overhead
  • Speed: Fast loading, good inference

💾 Q4_K_M — When RAM Is Tight

Significant compression with acceptable quality loss. Good for running larger models on limited hardware.

  • Best for: 8GB-16GB RAM systems, running 70B models on 24GB VRAM
  • Quality impact: ~15-20% reduction
  • Trade-off: Size vs. reasoning quality

🏆 Q8_0 — Maximum Quality

Closest to the original model. Minimal quality loss but requires twice the RAM of Q4.

  • Best for: Critical tasks, production use, research
  • RAM needed: 2x Q4 size
  • When: You have abundant memory (32GB+)

⚠️ Avoid Q2 and Q3

These are too aggressive. You get tiny files but the model becomes noticeably "dumber" — poor reasoning, repetitive outputs, and incoherent responses at longer contexts.

Real-World Example: Llama 3.3 70B

Let's look at how quantization affects one of the best open-source models:

Format File Size RAM Required Use Case
FP16 140 GB 143+ GB Enterprise only
Q8_0 74 GB 77+ GB High-end workstations
Q5_K_M 43 GB 46+ GB Mac Studio, high-end PCs
Q4_K_M 38 GB 41+ GB 64GB MacBook Pro

Key Insight: A $4,000 Mac Studio with 64GB unified memory can run Q5_K_M Llama 3.3 70B at near-GPT-4 quality, completely offline. That's the power of quantization.

Performance Impact: Does Quantization Slow Things Down?

Surprisingly, lower quantization often runs faster. Here's why:

  • Memory bandwidth: Smaller models load faster from RAM/VRAM
  • Cache efficiency: 4-bit weights fit better in CPU/GPU caches
  • Batch processing: More tokens processed per memory cycle

On Apple Silicon, Q4 models often achieve higher tokens/second than Q8 because the unified memory architecture favors smaller memory footprints. On NVIDIA GPUs with limited VRAM, Q4 enables larger batch sizes.

How to Choose: Quick Decision Tree

Step 1: Check your available RAM (System RAM for CPU, VRAM for GPU)

Step 2: Look up the model sizes at different quantizations

Step 3: Use this rule:

  • • If RAM ≥ 2x model size → Try Q5_K_M
  • • If RAM is tight → Use Q4_K_M
  • • If you need maximum quality → Use Q8_0

Or just use our LocalClaw configurator — we automatically recommend the optimal quantization based on your detected hardware.

Bottom line: Start with Q5_K_M for everything. Only go lower if you must, and only go higher if you have RAM to spare.