Qwen3 TTS
State-of-the-art multilingual TTS with natural prosody and emotion control. Supports 30+ languages with streaming inference.
Static, indexable pages for open local text-to-speech, voice cloning, ASR and speech AI models.
State-of-the-art multilingual TTS with natural prosody and emotion control. Supports 30+ languages with streaming inference.
Fast, local neural TTS optimized for Raspberry Pi and edge devices. Lightweight with good quality for embedded systems.
Flow-matching based TTS with SOTA quality and extremely fast inference. Simple and efficient architecture.
Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.
LLM-based TTS with human-level naturalness. Supports rich emotion tags (laugh, sigh, hesitation). Built on Llama 3 architecture for unprecedented expressiveness.
Open-source SOTA voice cloning from Resemble AI. Outperforms ElevenLabs on naturalness benchmarks. Supports emotion exaggeration control and ultra-stable generation.
1.6B dialogue TTS - generates realistic two-speaker conversations from a single transcript. Supports non-verbal cues like [laughs], [coughs], [sighs] natively.
Pure language model approach to TTS - no separate audio encoder. Runs via llama.cpp for fully local GGUF inference. Excellent for CPU-only setups.
Industrial-grade multilingual TTS with streaming, voice cloning and emotion control. Exceptional Chinese + English quality. Used in production at Alibaba scale.
Bilingual TTS with virtual speaker creation - control pitch, speed, gender from text. Built on Qwen2.5 LLM backbone for powerful generation.
Desktop app & orchestrator for local TTS - not a model. Provides a UI studio, voice profile management, and a local API. Generates audio via swappable backends (Qwen3 TTS, Kokoro, Piper, XTTS…). Think of it as a front-end shell that runs on top of your installed TTS models.
Conversational Speech Model - generates speech with natural turn-taking, backchannels and interruptions. Built specifically for multi-turn dialogue with real-time response generation.