Llama 4 Maverick (17B/400B MoE)
Meta Llama 4 Maverick — 128-expert MoE flagship. Matches or beats GPT-4o and Gemini 2.0 Flash on reasoning, coding and multimodal benchmarks. 1M-token context. Server-grade hardware only. Llama 4 Community License.
Best local multimodal and vision-language models for image understanding, OCR, document analysis and visual reasoning. Ranked from the LocalClaw model database with RAM requirements, quantization and links to static model pages.
For vision, start with Llama 4 Maverick (17B/400B MoE) if your hardware fits it. If not, choose the highest-ranked model that fits your RAM tier and preferred quantization.
Meta Llama 4 Maverick — 128-expert MoE flagship. Matches or beats GPT-4o and Gemini 2.0 Flash on reasoning, coding and multimodal benchmarks. 1M-token context. Server-grade hardware only. Llama 4 Community License.
Gemma 4 MoE flagship-for-workstations: 26B total with ~4B active parameters. 256K context and excellent quality-per-watt for local inference. Apache 2.0.
Largest Gemma 4 model for premium local quality. Strong coding and reasoning with 256K context and broad multilingual support. Apache 2.0.
Meta Llama 4 Scout — natively multimodal MoE with 16 experts. 10M-token context window. Outperforms Gemma 3 and Mistral Small on most benchmarks at similar active cost. Llama 4 Community License.
Qwen 3 VL flagship open vision model. Competes with GPT-4o on MMMU, chart-QA and document reasoning. Native video understanding up to 1 hour. Apache 2.0.
Meta's largest open MoE. 17B active params across 128 experts (~400B total). Multimodal with exceptional image reasoning. Server-grade hardware required. Llama 4 License.
Qwen 3 vision-language model. Strong OCR, document understanding, chart & UI reasoning. 128K context with native image+video inputs. Apache 2.0.
Qwen massive vision-language model. Exceptional image and video understanding at 72B scale. 72K context.
Google's flagship multimodal. Image + text understanding at an exceptional level.
Meta's largest vision model. 128K context with powerful image reasoning and analysis. Requires significant hardware.
Meta's multimodal MoE model. 17B active params across 16 experts (~109B total). Built-in image understanding. 10M token context window. Apache 2.0. 728K downloads.
Google's 12B multimodal beast. Understands images natively. Excellent quality for 16GB machines.
Gemma 4 balanced edge model with strong multimodal quality and 256K context. Great for laptops and high-end mobile devices. Apache 2.0.
Google on-device powerhouse with vision. Designed for phones/tablets/laptops but punches far above its weight. Per-layer memory management for constrained devices. Apache 2.0.
Shanghai AI Lab multimodal model. Strong vision understanding for documents, charts, and photos. MIT licensed. Note: primarily PyTorch/safetensors — community GGUF may vary.
Mistral Small + vision + 128K context. See and understand images. 311K downloads.
Qwen flagship vision-language model. Excellent at image understanding. 732K downloads.
Gemma 4 compact multimodal model for on-device usage. Supports text, image, audio, and video understanding with 256K context. Apache 2.0.
LocalClaw ranks models using their tags plus relative benchmark scores for speed, quality, coding and reasoning. The goal is a practical local setup recommendation, not a synthetic leaderboard.