Ling-2.6-flash (104B MoE)
InclusionAI's MIT-licensed instruct MoE optimized for fast agent workloads. 104B total parameters, only 7.4B active, hybrid linear attention, 262K context and strong tool-use / multi-step execution with high token efficiency.
Can Ling-2.6-flash (104B MoE) run locally?
Ling-2.6-flash (104B MoE) is best suited for large-memory workstations. LocalClaw recommends Q4_K_M as the default quantization, with at least 80 GB RAM.
Search term for LM Studio or compatible runtimes: Ling-2.6-flash
Hugging Face repository: inclusionAI/Ling-2.6-flash
Strengths
- ⚡ Very efficient MoE: 104B total but only 7.4B active parameters
- 262K context window for large prompts, repositories and agent traces
- Hybrid linear attention improves prefill/decode throughput on long workloads
- Strong tool-use and agent execution focus: BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval and PinchBench are highlighted by InclusionAI
- Designed to reduce unnecessary reasoning tokens for high-frequency production agent tasks
- MIT license — permissive open-source usage
Limitations
- Large 104B MoE files — server-grade hardware is still recommended despite only 7.4B active parameters
- Best support currently targets SGLang/vLLM with trust_remote_code rather than simple consumer one-click runtimes
- Official notes mention possible tool hallucinations in highly complex scenarios
- Natural Chinese/English switching and highly complex instruction following still have room to improve
Best use cases
- Agentic coding and codebase automation
- Tool-calling assistants and multi-step workflow execution
- Long-context analysis up to 262K tokens
- Production agents where token efficiency and latency matter
- Open-source research comparing efficient MoE architectures
- OpenClaw / Claude Code / Kilo Code style agent workflows
Benchmarks
Speed: 8/10
Quality: 9/10
Coding: 9/10
Reasoning: 8/10
Technical details
Developer: InclusionAI / Ant Group
License: MIT
Context window: 262,144 tokens
Architecture: Sparse MoE instruct model — 104B total parameters, 7.4B active. Hybrid linear attention using a 1:7 MLA + Lightning Linear architecture, optimized for high-throughput agent inference.
Released: 2026-04