Local LLM model page

ZAYA1-8B

Zyphra's Apache-2.0 reasoning MoE: 8.4B total parameters with only ~760M active, 16 experts, 131K context, Compressed Convolutional Attention and strong math/code benchmarks. Experimental for local use today: currently needs Zyphra vLLM/Transformers forks; LM Studio/GGUF/MLX support is not yet verified.

Parameters
8.4B (760M active, MoE)
Minimum RAM
24 GB
Model size
17 GB
Quantization
BF16 (Zyphra fork)

Can ZAYA1-8B run locally?

ZAYA1-8B is best suited for power-user machines with 32 GB RAM. LocalClaw recommends BF16 (Zyphra fork) as the default quantization, with at least 24 GB RAM.

Search term for LM Studio or compatible runtimes: Zyphra/ZAYA1-8B

Hugging Face repository: Zyphra/ZAYA1-8B

chatcodereasoningmathexperimental

Strengths

  • Very high intelligence density: 8.4B total with ~760M active parameters
  • Strong mathematics, coding and long-form reasoning benchmarks
  • 131K context window
  • Apache 2.0 license
  • Designed for test-time-compute workflows such as Markovian RSA

Limitations

  • Experimental local runtime support today
  • Currently documented with Zyphra forks of vLLM and Transformers
  • No verified LM Studio, Ollama, llama.cpp, GGUF or MLX support yet
  • BF16 weights are too heavy for a clean Mac mini M4 16 GB setup

Best use cases

  • Mathematical reasoning research
  • Coding and algorithmic problem solving
  • Reasoning benchmark experimentation
  • Server/local lab evaluation with Zyphra runtime forks
  • Future compact on-device MoE experiments once runtimes catch up

Benchmarks

Speed: 7/10

Quality: 8/10

Coding: 8/10

Reasoning: 9/10

Technical details

Developer: Zyphra

License: Apache 2.0

Context window: 131,072 tokens

Architecture: Sparse MoE with Compressed Convolutional Attention (CCA), 16 experts, top-1 MLP router and learned residual scaling

Released: 2026-05