Kokoro TTS
Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.
The most popular open TTS with incredible voice cloning from just 6 seconds of audio. Discontinued but widely used.
Coqui TTS (XTTS v2) is a local speech model from Coqui. It is best suited for cloning, multilingual workflows. Check the license before commercial use.
gpuapple
pytorchonnx
Unlimited via cloning
medium
CPML (custom)
2023-09
pip install TTScloningmultilingual
Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.
Production-grade streaming ASR from Kyutai (makers of Moshi). Delay-streaming transformer with 500ms latency, word-level timestamps, speaker diarization. Top of Open ASR Leaderboard for real-time French + English.
Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.
OpenAI's optimized Whisper v3 with 4 decoder layers instead of 32. 8× faster than Whisper Large v3 with only minor accuracy trade-off. 99 languages supported. New gold standard for fast local transcription.
First super-realistic TTS LLM that runs in real-time on CPU. 748M params, LLaMA 3.2 backbone + NeuCodec audio tokenizer. GGUF-native - perfect for on-device agents and offline apps. Instant 3s voice cloning.
Flow-matching based TTS with SOTA quality and extremely fast inference. Simple and efficient architecture.