Local LLM model page

ZAYA1-8B

Zyphra's Apache-2.0 reasoning MoE: 8.4B total parameters with only ~760M active, 16 experts, 131K context, Compressed Convolutional Attention and strong math/code benchmarks. Experimental for local use today: currently needs Zyphra vLLM/Transformers forks; LM Studio/GGUF/MLX support is not yet verified.

Find the best model for my hardware Browse all 183 LLMs

Parameters

8.4B (760M active, MoE)

Minimum RAM

24 GB

Model size

17 GB

Quantization

BF16 (Zyphra fork)

Can ZAYA1-8B run locally?

ZAYA1-8B is best suited for power-user machines with 32 GB RAM. LocalClaw recommends BF16 (Zyphra fork) as the default quantization, with at least 24 GB RAM.

Search term for LM Studio or compatible runtimes: Zyphra/ZAYA1-8B

Hugging Face repository: Zyphra/ZAYA1-8B

chatcodereasoningmathexperimental

Strengths

Very high intelligence density: 8.4B total with ~760M active parameters
Strong mathematics, coding and long-form reasoning benchmarks
131K context window
Apache 2.0 license
Designed for test-time-compute workflows such as Markovian RSA

Limitations

Experimental local runtime support today
Currently documented with Zyphra forks of vLLM and Transformers
No verified LM Studio, Ollama, llama.cpp, GGUF or MLX support yet
BF16 weights are too heavy for a clean Mac mini M4 16 GB setup

Best use cases

Mathematical reasoning research
Coding and algorithmic problem solving
Reasoning benchmark experimentation
Server/local lab evaluation with Zyphra runtime forks
Future compact on-device MoE experiments once runtimes catch up

Benchmarks

Speed: 7/10

Quality: 8/10

Coding: 8/10

Reasoning: 9/10

Technical details

Developer: Zyphra

License: Apache 2.0

Context window: 131,072 tokens

Architecture: Sparse MoE with Compressed Convolutional Attention (CCA), 16 experts, top-1 MLP router and learned residual scaling

Released: 2026-05