Local LLM model page

Ling-2.6-flash (104B MoE)

InclusionAI's MIT-licensed instruct MoE optimized for fast agent workloads. 104B total parameters, only 7.4B active, hybrid linear attention, 262K context and strong tool-use / multi-step execution with high token efficiency.

Find the best model for my hardware Browse all 183 LLMs

Parameters

104B (7.4B active)

Minimum RAM

80 GB

Model size

65 GB

Quantization

Q4_K_M

Can Ling-2.6-flash (104B MoE) run locally?

Ling-2.6-flash (104B MoE) is best suited for large-memory workstations. LocalClaw recommends Q4_K_M as the default quantization, with at least 80 GB RAM.

Search term for LM Studio or compatible runtimes: Ling-2.6-flash

Hugging Face repository: inclusionAI/Ling-2.6-flash

chatcodereasoningspeedquality

Strengths

⚡ Very efficient MoE: 104B total but only 7.4B active parameters
262K context window for large prompts, repositories and agent traces
Hybrid linear attention improves prefill/decode throughput on long workloads
Strong tool-use and agent execution focus: BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval and PinchBench are highlighted by InclusionAI
Designed to reduce unnecessary reasoning tokens for high-frequency production agent tasks
MIT license — permissive open-source usage

Limitations

Large 104B MoE files — server-grade hardware is still recommended despite only 7.4B active parameters
Best support currently targets SGLang/vLLM with trust_remote_code rather than simple consumer one-click runtimes
Official notes mention possible tool hallucinations in highly complex scenarios
Natural Chinese/English switching and highly complex instruction following still have room to improve

Best use cases

Agentic coding and codebase automation
Tool-calling assistants and multi-step workflow execution
Long-context analysis up to 262K tokens
Production agents where token efficiency and latency matter
Open-source research comparing efficient MoE architectures
OpenClaw / Claude Code / Kilo Code style agent workflows

Benchmarks

Speed: 8/10

Quality: 9/10

Coding: 9/10

Reasoning: 8/10

Technical details

Developer: InclusionAI / Ant Group

License: MIT

Context window: 262,144 tokens

Architecture: Sparse MoE instruct model — 104B total parameters, 7.4B active. Hybrid linear attention using a 1:7 MLA + Lightning Linear architecture, optimized for high-throughput agent inference.

Released: 2026-04