Local TTS model

XTTS v3 (Community)

Community-maintained successor to XTTS v2 after Coqui shut down. Improved cloning stability, lower artefacts, and 20+ languages. Drop-in replacement for existing XTTS pipelines with better prosody.

Quality
9.1/10
Speed
7/10
Size
2GB
Languages
20+

Quick answer

XTTS v3 (Community) is a local speech model from Coqui Community. It is best suited for cloning, multilingual, emotion workflows. Check the license before commercial use.

Model details

Hardware

gpuapple

Formats

pytorchonnx

Voices

Unlimited via cloning (6s sample)

Latency

low

License

MPL 2.0

Release

2025-06

Install command

pip install TTS-community

Features

cloningmultilingualemotion

Languages: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh, ja, hu, ko, hi, ro, sv, uk

Context: Community fork - active maintenance

Related TTS models

hexgrad

Kokoro TTS

Quality 9.2 · Speed 9.8 · 0.33GB · Apache 2.0

Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.

realtimestreaminglow-latencymultilingual
Kyutai

Kyutai STT 2.6B

Quality 9.4 · Speed 9.5 · 2.7GB · CC-BY-4.0

Production-grade streaming ASR from Kyutai (makers of Moshi). Delay-streaming transformer with 500ms latency, word-level timestamps, speaker diarization. Top of Open ASR Leaderboard for real-time French + English.

streamingrealtimelow-latencymultilingual
Speech Research (SWivid)

F5-TTS v1.1

Quality 9.5 · Speed 9.2 · 1.6GB · MIT

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

realtimecloningstreamingmultilingual
OpenAI

Whisper v3 Turbo

Quality 9.1 · Speed 9.5 · 1.6GB · MIT

OpenAI's optimized Whisper v3 with 4 decoder layers instead of 32. 8× faster than Whisper Large v3 with only minor accuracy trade-off. 99 languages supported. New gold standard for fast local transcription.

streamingrealtimemultilinguallow-latency
Kyutai

Moshi

Quality 9 · Speed 9.5 · 7.5GB · CC-BY-4.0

Full-duplex spoken dialogue model - listens and speaks simultaneously with ~160 ms latency. Not just a TTS but a real-time conversational speech model. Runs on a single L4 GPU or Mac M3 Pro.

dialoguestreamingrealtimelow-latencyemotion
Neuphonic

NeuTTS Air

Quality 9 · Speed 9.5 · 0.75GB · Apache 2.0

First super-realistic TTS LLM that runs in real-time on CPU. 748M params, LLaMA 3.2 backbone + NeuCodec audio tokenizer. GGUF-native - perfect for on-device agents and offline apps. Instant 3s voice cloning.

cloningrealtimestreaminglow-latency