Local LLM model page

Ling-2.6-flash (104B MoE)

InclusionAI's MIT-licensed instruct MoE optimized for fast agent workloads. 104B total parameters, only 7.4B active, hybrid linear attention, 262K context and strong tool-use / multi-step execution with high token efficiency.

Parameters
104B (7.4B active)
Minimum RAM
80 GB
Model size
65 GB
Quantization
Q4_K_M

Can Ling-2.6-flash (104B MoE) run locally?

Ling-2.6-flash (104B MoE) is best suited for large-memory workstations. LocalClaw recommends Q4_K_M as the default quantization, with at least 80 GB RAM.

Search term for LM Studio or compatible runtimes: Ling-2.6-flash

Hugging Face repository: inclusionAI/Ling-2.6-flash

chatcodereasoningspeedquality

Strengths

  • ⚡ Very efficient MoE: 104B total but only 7.4B active parameters
  • 262K context window for large prompts, repositories and agent traces
  • Hybrid linear attention improves prefill/decode throughput on long workloads
  • Strong tool-use and agent execution focus: BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval and PinchBench are highlighted by InclusionAI
  • Designed to reduce unnecessary reasoning tokens for high-frequency production agent tasks
  • MIT license — permissive open-source usage

Limitations

  • Large 104B MoE files — server-grade hardware is still recommended despite only 7.4B active parameters
  • Best support currently targets SGLang/vLLM with trust_remote_code rather than simple consumer one-click runtimes
  • Official notes mention possible tool hallucinations in highly complex scenarios
  • Natural Chinese/English switching and highly complex instruction following still have room to improve

Best use cases

  • Agentic coding and codebase automation
  • Tool-calling assistants and multi-step workflow execution
  • Long-context analysis up to 262K tokens
  • Production agents where token efficiency and latency matter
  • Open-source research comparing efficient MoE architectures
  • OpenClaw / Claude Code / Kilo Code style agent workflows

Benchmarks

Speed: 8/10

Quality: 9/10

Coding: 9/10

Reasoning: 8/10

Technical details

Developer: InclusionAI / Ant Group

License: MIT

Context window: 262,144 tokens

Architecture: Sparse MoE instruct model — 104B total parameters, 7.4B active. Hybrid linear attention using a 1:7 MLA + Lightning Linear architecture, optimized for high-throughput agent inference.

Released: 2026-04