Qwen 3 MoE (235B/22B active)
Mixture of Experts behemoth. Only 22B params active at once = fast despite massive size. Top-tier.
Mac Studio M4 Ultra 128GB with 128GB unified memory is a large model local inference machine. This page lists local AI models that fit its memory budget, with realistic performance expectations for LM Studio and similar runtimes.
For Mac Studio M4 Ultra 128GB, start with Qwen 3 MoE (235B/22B active). Models marked “Comfortable” leave useful memory headroom; “Tight but possible” can work, but you should close other apps and prefer lower quantization.
Mixture of Experts behemoth. Only 22B params active at once = fast despite massive size. Top-tier.
Mistral flagship. 128K context. Top-tier coding and multilingual. 262K downloads. Requires serious hardware.
Cohere enterprise flagship. Top-tier for RAG and enterprise use. 58K downloads.
Cohere open-weight flagship optimised for agentic workflows and long-context RAG. 256K context, excellent multilingual coverage (23 languages). CC-BY-NC 4.0 — non-commercial.
Arcee AI's massive MoE open model. ~400B total parameters, 70B active per forward pass. Ranks near the top of global usage leaderboards. Exceptional versatility across reasoning, coding and chat. Free and open-source. Apache 2.0.
DeepSeek's massive MoE flagship. 37B active out of 671B total. Exceptional coding, reasoning and general capabilities. Ranks #6 on global usage leaderboards with 29B monthly tokens. MIT licensed.
Meta Llama 4 Scout — natively multimodal MoE with 16 experts. 10M-token context window. Outperforms Gemma 3 and Mistral Small on most benchmarks at similar active cost. Llama 4 Community License.
Large MoE model with only 10B active params. 60% cheaper to run than Qwen3-Max. 256K context. Top-tier reasoning, coding and multilingual. Hybrid think/non-think. Apache 2.0.
Alibaba's next-gen MoE with hybrid-gated DeltaNet attention. Only 3B active params — runs at dense 7B speed with 70B quality. 256K native context (extensible to 1M). Hybrid thinking mode. Apache 2.0.
Qwen 3.6 flagship dense model. Hybrid thinking mode with /think toggle for deep chain-of-thought reasoning. 128K context, 29+ languages. Significantly outperforms Qwen3.5-27B on reasoning, coding & math. Apache 2.0.
Qwen flagship coding model. Designed for agentic coding with 256K context. Outperforms Claude 3.5 Sonnet on SWE-bench. Apache 2.0.
Near GPT-4 intelligence locally. Thinking mode demolishes hard problems. The local AI dream.
This page is about local AI fit, not a live price tracker. Prices and availability change. If an Amazon link is present, it may be an affiliate link that supports LocalClaw at no extra cost.