RAM tier guide

Best local LLMs for 128GB RAM

A static, Google-indexable guide to the best local AI models that fit in a 128GB RAM budget. Built from the LocalClaw model database and ranked by quality, reasoning, coding and speed.

Compatible models
166
Best pick
Qwen 3 Coder (30B)
RAM tier
128GB
Hardware fit
large-memory workstations and server-grade local AI machines

Quick answer

With 128GB RAM, prioritize models with minimum RAM at or below 128GB and avoid filling memory completely. For most users, start with Qwen 3 Coder (30B), then test a faster smaller model if latency matters.

Top models for 128GB RAM

#1

Qwen 3 Coder (30B)

30B · 24GB min · Q4_K_M · 18GB

Qwen flagship coding model. Designed for agentic coding with 256K context. Outperforms Claude 3.5 Sonnet on SWE-bench. Apache 2.0.

codepowerquality
#2

Qwen 3.6 (27B)

27B · 32GB min · Q4_K_M · 17GB

Qwen 3.6 flagship dense model. Hybrid thinking mode with /think toggle for deep chain-of-thought reasoning. 128K context, 29+ languages. Significantly outperforms Qwen3.5-27B on reasoning, coding & math. Apache 2.0.

chatcodereasoningpowerquality
#3

Gemma 4 26B A4B

26B (A4B active) · 24GB min · Q4_K_M · 16GB

Gemma 4 MoE flagship-for-workstations: 26B total with ~4B active parameters. 256K context and excellent quality-per-watt for local inference. Apache 2.0.

chatcodereasoningpowermultimodal
#4

Qwen 3 (32B)

32B · 32GB min · Q4_K_M · 20GB

Near GPT-4 intelligence locally. Thinking mode demolishes hard problems. The local AI dream.

chatcodereasoningpowerquality
#5

Kimi K2.5 (32B/1T MoE)

32B active (1T total MoE) · 32GB min · Q4_K_M · 22GB

Moonshot AI's agentic flagship. 1T total MoE parameters with 32B active per forward pass. Unmatched long-context reasoning at 256K tokens. Designed for complex agentic tasks and tool use. Model License — check moonshotai.com for commercial terms.

chatcodereasoningpowerquality
#6

Qwen 3.5 MoE (122B/10B active)

122B (10B active) · 80GB min · Q4_K_M · 65GB

Large MoE model with only 10B active params. 60% cheaper to run than Qwen3-Max. 256K context. Top-tier reasoning, coding and multilingual. Hybrid think/non-think. Apache 2.0.

chatcodereasoningqualitypower
#7

Qwen 3 Next (80B/3B MoE)

80B (3B active) · 64GB min · Q4_K_M · 48GB

Alibaba's next-gen MoE with hybrid-gated DeltaNet attention. Only 3B active params — runs at dense 7B speed with 70B quality. 256K native context (extensible to 1M). Hybrid thinking mode. Apache 2.0.

chatcodereasoningpowerquality
#8

Gemma 4 31B

31B · 32GB min · Q4_K_M · 19GB

Largest Gemma 4 model for premium local quality. Strong coding and reasoning with 256K context and broad multilingual support. Apache 2.0.

chatcodereasoningqualitymultimodal
#9

DeepSeek V3.2 (37B/671B MoE)

37B (671B MoE) · 48GB min · Q4_K_M · 40GB

DeepSeek's massive MoE flagship. 37B active out of 671B total. Exceptional coding, reasoning and general capabilities. Ranks #6 on global usage leaderboards with 29B monthly tokens. MIT licensed.

chatcodereasoningpowerquality
#10

Trinity Large Preview (70B MoE)

70B (MoE, ~400B total) · 48GB min · Q4_K_M · 45GB

Arcee AI's massive MoE open model. ~400B total parameters, 70B active per forward pass. Ranks near the top of global usage leaderboards. Exceptional versatility across reasoning, coding and chat. Free and open-source. Apache 2.0.

chatcodereasoningpowerquality
#11

Qwen 3.5 (27B)

27B · 32GB min · Q4_K_M · 17GB

Dense 27B powerhouse. Hybrid thinking/non-thinking mode. Strong multilingual (29+ languages). 256K context window. Excellent instruction-following and math. Apache 2.0.

chatcodereasoningpowergeneral
#12

Qwen 3.5 MoE (35B/3B active)

35B (3B active) · 24GB min · Q4_K_M · 20GB

MoE gem — only 3B params active at inference. 19x faster than Qwen3-Max at 256K context. Best quality-per-watt of the series. Hybrid thinking mode. Runs on Mac Studio 32GB. Agentic coding standout.

chatcodereasoningpowerspeed
#13

Llama-3.3-Nemotron-Super (49B)

49B · 40GB min · Q4_K_M · 30GB

NVIDIA's super-efficient 49B distilled from DeepSeek-R1 + Llama. Outperforms Llama-3.3-70B at half the compute. Strong reasoning, coding & instruction following. Runs on Mac Studio 64GB. NVIDIA Open Model License.

chatreasoningcodepowerquality
#14

GLM 4.5 Air (MoE)

106B (14B active, MoE) · 16GB min · Q4_K_M · 9GB

Zhipu AI's efficient MoE powerhouse. 106B total parameters, only 14B active at inference — dense-model speed with much larger model quality. Clearly the best in the 16–24GB RAM range. Outperforms Llama 3.3 70B. Apache 2.0.

chatcodepowerqualitygeneral
#15

Command A (111B)

111B · 96GB min · Q4_K_M · 68GB

Cohere open-weight flagship optimised for agentic workflows and long-context RAG. 256K context, excellent multilingual coverage (23 languages). CC-BY-NC 4.0 — non-commercial.

chatreasoningqualitygeneralpower
#16

MiroThinker v1.5 (30B MoE)

30B (3B active, MoE) · 48GB min · Q4_K_M · 18GB

⚠️ Despite the small active count, this is a full 30B MoE model (Qwen3-30B-A3B base). ~82 GB full weights (Q4_K_M ≈18 GB). Deep-research agent with 256K context, tool calls, multilingual (EN/ZH). Requires H100 80 GB or serious multi-GPU. Not suitable for M1/M2 or consumer GPUs. Apache 2.0.

reasoningcodepowerquality
#17

MiroThinker 1.7 (30B MoE)

30B (3B active, MoE) · 48GB min · Q4_K_M · 18GB

MiroMind AI second-gen deep-research agent. 30B MoE with stronger tool-use, 256K context, SOTA on BrowseComp-ZH (Chinese research). Designed for agentic workflows, not casual chat. Released March 2026. Apache 2.0.

reasoningcodepowerquality
#18

MiroThinker 1.7 Mini (30B MoE)

30B (3B active, MoE) · 48GB min · Q4_K_M · 18GB

⚠️ Despite the "Mini" name, this is a full 30B MoE model (Qwen3-30B-A3B). 3B = active params per forward pass, NOT model size. ~82 GB full weights. Requires H100 80GB or multi-GPU. 256K context, multilingual (EN/ZH+), deep-research agent with tool calls. Released 11 Mar 2026. Apache 2.0.

reasoningcodepowerquality

How to choose at 128GB