Use-case guide

Best fast local LLMs for low-latency use in 2026

Best small and fast local LLMs for low-latency chat, laptops, edge machines and 8GB to 16GB RAM setups. Ranked from the LocalClaw model database with RAM requirements, quantization and links to static model pages.

Matching models
33
Best pick
Llama-3.1-Nemotron-Nano (4B)
Primary signal
speed, light, edge
SEO query
fastest local LLM

Quick answer

For fast / small, start with Llama-3.1-Nemotron-Nano (4B) if your hardware fits it. If not, choose the highest-ranked model that fits your RAM tier and preferred quantization.

Top local models for fast / small

#1

Llama-3.1-Nemotron-Nano (4B)

4B · 6GB RAM · Q5_K_M · Q:7 C:6 R:8 S:10

⭐ Mac Mini M4 16GB top pick! NVIDIA fine-tune of Llama 3.1. Hybrid /think • /no_think mode — deep reasoning on demand, instant chat otherwise. ~80–120 tok/s on Apple Silicon Metal. 128K context. Apache 2.0.

chatlightspeedreasoning
#2

Nemotron 3 Nano (4B)

4B · 6GB RAM · Q5_K_M · Q:7 C:6 R:7 S:10

⭐ Mac Mini M4 16GB top pick! NVIDIA's hybrid model — distilled from 9B, keeps 95% of its quality. Hybrid attention + SSM layers = ~80–120 tok/s on Apple Silicon. Blazing fast, minimal RAM. NVIDIA Open Model License.

chatlightspeedreasoning
#3

Qwen 3.5 MoE (35B/3B active)

35B (3B active) · 24GB RAM · Q4_K_M · Q:8 C:9 R:8 S:9

MoE gem — only 3B params active at inference. 19x faster than Qwen3-Max at 256K context. Best quality-per-watt of the series. Hybrid thinking mode. Runs on Mac Studio 32GB. Agentic coding standout.

chatcodereasoningpowerspeed
#4

Qwen 3.6 (6.7B)

6.7B · 8GB RAM · Q4_K_M · Q:7 C:7 R:8 S:9

Alibaba's hybrid-thinking micro-flagship. Toggles between instant answers and deep chain-of-thought reasoning on demand. 128K context, 29 languages, outperforms Qwen3-8B on reasoning benchmarks. Apache 2.0.

chatcodereasoningspeedgeneral
#5

Granite 3.3 (2B Instruct)

2B · 4GB RAM · Q5_K_M · Q:6 C:6 R:5 S:10

IBM ultra-efficient 2B. Best-in-class among small models for tool calling & structured output. Perfect for on-device RAG and agents. 128K context. Apache 2.0.

chatlightedgespeedcode
#6

Ling-2.6-flash (104B MoE)

104B (7.4B active) · 80GB RAM · Q4_K_M · Q:9 C:9 R:8 S:8

InclusionAI's MIT-licensed instruct MoE optimized for fast agent workloads. 104B total parameters, only 7.4B active, hybrid linear attention, 262K context and strong tool-use / multi-step execution with high token efficiency.

chatcodereasoningspeedquality
#7

Qwen 3.5 (2B)

2B · 4GB RAM · Q4_K_M · Q:5 C:5 R:4 S:10

Ultra-compact Qwen 3.5 with hybrid thinking mode and 256K context. Runs comfortably on 4 GB RAM — ideal for MacBook Air M1/M2, Windows laptops, and edge devices. Apache 2.0.

chatcodeedgespeed
#8

DeepScaleR (1.5B)

1.5B · 4GB RAM · Q5_K_M · Q:5 C:4 R:8 S:10

Tiny model beating o1-preview on math! Incredible reasoning-to-size ratio. 474K downloads.

reasoninglightspeed
#9

GLM 4.7 Flash

14B · 16GB RAM · Q5_K_M · Q:7 C:7 R:7 S:9

Zhipu AI's fast GLM model. 14B parameters optimized for quick responses with strong bilingual (CN/EN) capabilities. Efficient inference for everyday tasks. Apache 2.0.

chatcodepowerspeed
#10

Qwen 3.5 (4B)

4B · 6GB RAM · Q4_K_M · Q:6 C:6 R:6 S:9

Sweet-spot small model. Surprisingly capable for its size with hybrid thinking, 256K context and strong multilingual support. Runs on 8 GB RAM. The go-to for MacBook Air M4 16 GB. Apache 2.0.

chatcodereasoningspeedgeneral
#11

Gemma 4 E2B

E2B · 6GB RAM · Q5_K_M · Q:6 C:5 R:6 S:9

Gemma 4 compact multimodal model for on-device usage. Supports text, image, audio, and video understanding with 256K context. Apache 2.0.

chatvisionspeededgemultimodalgeneral
#12

Qwen 3 (4B)

4B · 4GB RAM · Q5_K_M · Q:6 C:7 R:7 S:9

Alibaba's think-then-answer model. Built-in chain-of-thought reasoning at just 4B params.

chatcodelightspeedreasoning
#13

Phi-4 Mini Reasoning (3.8B)

3.8B · 6GB RAM · Q5_K_M · Q:6 C:6 R:7 S:9

Phi-4 mini balanced reasoning variant. Good chain-of-thought at tiny size. 62K downloads.

reasoninglightspeed
#14

Phi-4 Mini (3.8B)

3.8B · 4GB RAM · Q5_K_M · Q:6 C:7 R:6 S:9

Microsoft's latest small miracle. Punches way above its weight in reasoning & code.

chatcodelightspeed
#15

Qwen 3.5 (0.8B)

0.8B · 2GB RAM · Q4_K_M · Q:4 C:4 R:3 S:10

The tiniest Qwen 3.5 — runs on any device including smartphones and Raspberry Pi. Hybrid thinking mode, 256K context, Apache 2.0. Perfect for on-device AI and embedded apps.

chatcodeedgespeed
#16

Phi-3 (3.8B)

3.8B · 6GB RAM · Q5_K_M · Q:6 C:6 R:6 S:9

Microsoft lightweight powerhouse. Punches way above its weight. 11.3M downloads. Great for edge devices.

chatreasoninglightspeed
#17

Phi-3.5 (3.8B)

3.8B · 6GB RAM · Q5_K_M · Q:6 C:6 R:6 S:9

Phi 3.5 overtaking larger models. Excellent small model from Microsoft. 333K downloads.

chatlightspeed
#18

Granite 3.2 Vision (2B)

2B · 4GB RAM · Q5_K_M · Q:5 C:3 R:5 S:10

IBM vision model for document extraction. Tiny but effective at understanding documents. 365K downloads.

visionlightspeed

How this ranking works

LocalClaw ranks models using their tags plus relative benchmark scores for speed, quality, coding and reasoning. The goal is a practical local setup recommendation, not a synthetic leaderboard.