Local TTS model

Coqui TTS (XTTS v2)

The most popular open TTS with incredible voice cloning from just 6 seconds of audio. Discontinued but widely used.

Quality
9.2/10
Speed
6/10
Size
1.8GB
Languages
17+

Quick answer

Coqui TTS (XTTS v2) is a local speech model from Coqui. It is best suited for cloning, multilingual workflows. Check the license before commercial use.

Model details

Hardware

gpuapple

Formats

pytorchonnx

Voices

Unlimited via cloning

Latency

medium

License

CPML (custom)

Release

2023-09

Install command

pip install TTS

Features

cloningmultilingual

Languages: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh, ja, hu, ko

Context: 6s cloning, emotion

Related TTS models

hexgrad

Kokoro TTS

Quality 9.2 · Speed 9.8 · 0.33GB · Apache 2.0

Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.

realtimestreaminglow-latencymultilingual
Kyutai

Kyutai STT 2.6B

Quality 9.4 · Speed 9.5 · 2.7GB · CC-BY-4.0

Production-grade streaming ASR from Kyutai (makers of Moshi). Delay-streaming transformer with 500ms latency, word-level timestamps, speaker diarization. Top of Open ASR Leaderboard for real-time French + English.

streamingrealtimelow-latencymultilingual
Speech Research (SWivid)

F5-TTS v1.1

Quality 9.5 · Speed 9.2 · 1.6GB · MIT

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

realtimecloningstreamingmultilingual
OpenAI

Whisper v3 Turbo

Quality 9.1 · Speed 9.5 · 1.6GB · MIT

OpenAI's optimized Whisper v3 with 4 decoder layers instead of 32. 8× faster than Whisper Large v3 with only minor accuracy trade-off. 99 languages supported. New gold standard for fast local transcription.

streamingrealtimemultilinguallow-latency
Neuphonic

NeuTTS Air

Quality 9 · Speed 9.5 · 0.75GB · Apache 2.0

First super-realistic TTS LLM that runs in real-time on CPU. 748M params, LLaMA 3.2 backbone + NeuCodec audio tokenizer. GGUF-native - perfect for on-device agents and offline apps. Instant 3s voice cloning.

cloningrealtimestreaminglow-latency
Speech Research

F5-TTS

Quality 9.4 · Speed 9 · 1.5GB · MIT

Flow-matching based TTS with SOTA quality and extremely fast inference. Simple and efficient architecture.

realtimecloningstreaming