TL;DR: Q5_K_M is the sweet spot for most users — 70% smaller than the original model with minimal quality loss. Use Q4_K_M if you're tight on RAM. Only use Q8_0 if you need maximum quality and have abundant memory.
What Is Quantization?
Quantization is compression for AI models. Think of it like JPEG for images or MP3 for audio — it reduces file size while (mostly) preserving quality. For LLMs, this means fitting larger, smarter models into limited RAM or VRAM.
Without quantization, a 70-billion parameter model requires 140 GB of memory at full precision. That's impossible for consumer hardware. With Q4 quantization, the same model fits in under 40 GB — accessible to high-end consumer machines.
The Quantization Levels Explained
| Format | Bits per Parameter | Size Reduction | Quality |
|---|---|---|---|
| FP16 (Full) | 16 bits | 100% (baseline) | ★★★★★ Perfect |
| Q8_0 | 8 bits | ~50% of original | ★★★★☆ Excellent |
| Q6_K | 6 bits | ~62% reduction | ★★★★☆ Very Good |
| Q5_K_M ⭐ | 5 bits | ~70% reduction | ★★★★☆ Best Balance |
| Q4_K_M | 4 bits | ~75% reduction | ★★★☆☆ Good |
| Q3_K_M | 3 bits | ~82% reduction | ★★☆☆☆ Acceptable |
| Q2_K | 2 bits | ~87% reduction | ★☆☆☆☆ Poor |
What Do the Letters Mean?
K
K-quants. Uses "K-means" clustering for better quantization. Generally higher quality than legacy methods.
_M / _S
Medium or Small mix. Different weight distributions. _M is balanced; _S prioritizes smaller size.
_0
Legacy format. Simpler but less efficient. Still used for Q8 for maximum compatibility.
Which Should You Choose?
🎯 Q5_K_M — The Sweet Spot (Recommended)
Best balance of quality and size. You lose ~5-10% quality compared to full precision, but the model is 70% smaller.
- • Best for: General chat, coding, writing
- • RAM needed: Model size + 2-3GB overhead
- • Speed: Fast loading, good inference
💾 Q4_K_M — When RAM Is Tight
Significant compression with acceptable quality loss. Good for running larger models on limited hardware.
- • Best for: 8GB-16GB RAM systems, running 70B models on 24GB VRAM
- • Quality impact: ~15-20% reduction
- • Trade-off: Size vs. reasoning quality
🏆 Q8_0 — Maximum Quality
Closest to the original model. Minimal quality loss but requires twice the RAM of Q4.
- • Best for: Critical tasks, production use, research
- • RAM needed: 2x Q4 size
- • When: You have abundant memory (32GB+)
⚠️ Avoid Q2 and Q3
These are too aggressive. You get tiny files but the model becomes noticeably "dumber" — poor reasoning, repetitive outputs, and incoherent responses at longer contexts.
Real-World Example: Llama 3.3 70B
Let's look at how quantization affects one of the best open-source models:
| Format | File Size | RAM Required | Use Case |
|---|---|---|---|
| FP16 | 140 GB | 143+ GB | Enterprise only |
| Q8_0 | 74 GB | 77+ GB | High-end workstations |
| Q5_K_M | 43 GB | 46+ GB | Mac Studio, high-end PCs |
| Q4_K_M | 38 GB | 41+ GB | 64GB MacBook Pro |
Key Insight: A $4,000 Mac Studio with 64GB unified memory can run Q5_K_M Llama 3.3 70B at near-GPT-4 quality, completely offline. That's the power of quantization.
Performance Impact: Does Quantization Slow Things Down?
Surprisingly, lower quantization often runs faster. Here's why:
- ▸ Memory bandwidth: Smaller models load faster from RAM/VRAM
- ▸ Cache efficiency: 4-bit weights fit better in CPU/GPU caches
- ▸ Batch processing: More tokens processed per memory cycle
On Apple Silicon, Q4 models often achieve higher tokens/second than Q8 because the unified memory architecture favors smaller memory footprints. On NVIDIA GPUs with limited VRAM, Q4 enables larger batch sizes.
How to Choose: Quick Decision Tree
Step 1: Check your available RAM (System RAM for CPU, VRAM for GPU)
Step 2: Look up the model sizes at different quantizations
Step 3: Use this rule:
- • If RAM ≥ 2x model size → Try Q5_K_M
- • If RAM is tight → Use Q4_K_M
- • If you need maximum quality → Use Q8_0
Or just use our LocalClaw configurator — we automatically recommend the optimal quantization based on your detected hardware.
Bottom line: Start with Q5_K_M for everything. Only go lower if you must, and only go higher if you have RAM to spare.