GLM 4.5 Air (MoE)
Zhipu AI's efficient MoE powerhouse. 106B total parameters, only 14B active at inference — dense-model speed with much larger model quality. Clearly the best in the 16–24GB RAM range. Outperforms Llama 3.3 70B. Apache 2.0.
A static, Google-indexable guide to the best local AI models that fit in a 16GB RAM budget. Built from the LocalClaw model database and ranked by quality, reasoning, coding and speed.
With 16GB RAM, prioritize models with minimum RAM at or below 16GB and avoid filling memory completely. For most users, start with GLM 4.5 Air (MoE), then test a faster smaller model if latency matters.
Zhipu AI's efficient MoE powerhouse. 106B total parameters, only 14B active at inference — dense-model speed with much larger model quality. Clearly the best in the 16–24GB RAM range. Outperforms Llama 3.3 70B. Apache 2.0.
The sweet spot. Incredible reasoning, coding and chat quality. The best model you can run on 16GB.
ServiceNow x NVIDIA mid-size reasoner. Half the memory of 32B reasoners with comparable performance on MBPP, BFCL, GPQA. Strong enterprise fit. MIT licensed.
Zhipu AI lightweight flagship. Strong bilingual CN/EN with hybrid thinking mode, 200K context and tool calling. Apache 2.0 — excellent alternative to Qwen 3.5 9B on modest GPUs.
Microsoft Phi-4 reasoning variant. Top choice for 14B reasoning — much better than DeepSeek R1 14B. Rivals larger models on math & logic.
NVIDIA hybrid Mamba-Transformer 9B. 6x throughput vs comparable dense models, 128K context, strong maths/code. Efficient toggle-able reasoning. NVIDIA Open Model License.
Updated R1 reasoning distilled to Qwen3-8B. Improved chain-of-thought with fewer hallucinations vs original R1 distills. MIT licensed.
Qwen 3 vision-language model. Strong OCR, document understanding, chart & UI reasoning. 128K context with native image+video inputs. Apache 2.0.
Alibaba's hybrid-thinking micro-flagship. Toggles between instant answers and deep chain-of-thought reasoning on demand. 128K context, 29 languages, outperforms Qwen3-8B on reasoning benchmarks. Apache 2.0.
Microsoft's full Phi-4. Compact powerhouse with exceptional reasoning and coding for its size. MIT licensed.
⭐ Mac Mini M4 16GB top pick! NVIDIA fine-tune of Llama 3.1. Hybrid /think • /no_think mode — deep reasoning on demand, instant chat otherwise. ~80–120 tok/s on Apple Silicon Metal. 128K context. Apache 2.0.
One of the best 8B models ever made. Thinking mode + lightning fast. The new king of 8B.
O3-mini level open coder. Strong reasoning + coding combo. 326K downloads.
OpenAI open-weight reasoning model. First open release from OpenAI. Strong general + coding capabilities. 3.4M downloads.
Perplexity unbiased DeepSeek R1. Debiased reasoning model. 115K downloads.
⭐ Mac Mini M4 16GB top pick! NVIDIA's hybrid model — distilled from 9B, keeps 95% of its quality. Hybrid attention + SSM layers = ~80–120 tok/s on Apple Silicon. Blazing fast, minimal RAM. NVIDIA Open Model License.
Strong 14B from Alibaba. 18T tokens training. Excellent for multilingual tasks and coding.
Meta's multimodal MoE model. 17B active params across 16 experts (~109B total). Built-in image understanding. 10M token context window. Apache 2.0. 728K downloads.