Llama-3.1-Nemotron-Nano (4B)
⭐ Mac Mini M4 16GB top pick! NVIDIA fine-tune of Llama 3.1. Hybrid /think • /no_think mode — deep reasoning on demand, instant chat otherwise. ~80–120 tok/s on Apple Silicon Metal. 128K context. Apache 2.0.
Best small and fast local LLMs for low-latency chat, laptops, edge machines and 8GB to 16GB RAM setups. Ranked from the LocalClaw model database with RAM requirements, quantization and links to static model pages.
For fast / small, start with Llama-3.1-Nemotron-Nano (4B) if your hardware fits it. If not, choose the highest-ranked model that fits your RAM tier and preferred quantization.
⭐ Mac Mini M4 16GB top pick! NVIDIA fine-tune of Llama 3.1. Hybrid /think • /no_think mode — deep reasoning on demand, instant chat otherwise. ~80–120 tok/s on Apple Silicon Metal. 128K context. Apache 2.0.
⭐ Mac Mini M4 16GB top pick! NVIDIA's hybrid model — distilled from 9B, keeps 95% of its quality. Hybrid attention + SSM layers = ~80–120 tok/s on Apple Silicon. Blazing fast, minimal RAM. NVIDIA Open Model License.
MoE gem — only 3B params active at inference. 19x faster than Qwen3-Max at 256K context. Best quality-per-watt of the series. Hybrid thinking mode. Runs on Mac Studio 32GB. Agentic coding standout.
Alibaba's hybrid-thinking micro-flagship. Toggles between instant answers and deep chain-of-thought reasoning on demand. 128K context, 29 languages, outperforms Qwen3-8B on reasoning benchmarks. Apache 2.0.
IBM ultra-efficient 2B. Best-in-class among small models for tool calling & structured output. Perfect for on-device RAG and agents. 128K context. Apache 2.0.
InclusionAI's MIT-licensed instruct MoE optimized for fast agent workloads. 104B total parameters, only 7.4B active, hybrid linear attention, 262K context and strong tool-use / multi-step execution with high token efficiency.
Ultra-compact Qwen 3.5 with hybrid thinking mode and 256K context. Runs comfortably on 4 GB RAM — ideal for MacBook Air M1/M2, Windows laptops, and edge devices. Apache 2.0.
Tiny model beating o1-preview on math! Incredible reasoning-to-size ratio. 474K downloads.
Zhipu AI's fast GLM model. 14B parameters optimized for quick responses with strong bilingual (CN/EN) capabilities. Efficient inference for everyday tasks. Apache 2.0.
Sweet-spot small model. Surprisingly capable for its size with hybrid thinking, 256K context and strong multilingual support. Runs on 8 GB RAM. The go-to for MacBook Air M4 16 GB. Apache 2.0.
Gemma 4 compact multimodal model for on-device usage. Supports text, image, audio, and video understanding with 256K context. Apache 2.0.
Alibaba's think-then-answer model. Built-in chain-of-thought reasoning at just 4B params.
Phi-4 mini balanced reasoning variant. Good chain-of-thought at tiny size. 62K downloads.
Microsoft's latest small miracle. Punches way above its weight in reasoning & code.
The tiniest Qwen 3.5 — runs on any device including smartphones and Raspberry Pi. Hybrid thinking mode, 256K context, Apache 2.0. Perfect for on-device AI and embedded apps.
Microsoft lightweight powerhouse. Punches way above its weight. 11.3M downloads. Great for edge devices.
Phi 3.5 overtaking larger models. Excellent small model from Microsoft. 333K downloads.
IBM vision model for document extraction. Tiny but effective at understanding documents. 365K downloads.
LocalClaw ranks models using their tags plus relative benchmark scores for speed, quality, coding and reasoning. The goal is a practical local setup recommendation, not a synthetic leaderboard.