Local TTS model

F5-TTS v1.1

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

Quality
9.5/10
Speed
9.2/10
Size
1.6GB
Languages
5+

Quick answer

F5-TTS v1.1 is a local speech model from Speech Research (SWivid). It is best suited for realtime, cloning, streaming, multilingual workflows. Check the license before commercial use.

Model details

Hardware

gpuapple

Formats

pytorchsafetensors

Voices

Reference-based cloning

Latency

ultra-low

License

MIT

Release

2025-08

Install command

pip install f5-tts

Features

realtimecloningstreamingmultilingual

Languages: en, zh, ja, fr, de

Context: Improved flow-matching + streaming

Related TTS models

NVIDIA

Parakeet TDT 0.6B v2

Quality 9.4 · Speed 10 · 1.1GB · CC-BY-4.0

NVIDIA's SOTA lightweight ASR - 0.6B params, #1 on Open ASR Leaderboard for English. TDT (Token-and-Duration Transducer) decoding makes it 50× faster than Whisper Large v3 on GPU. Real-time streaming with word-level timestamps.

streamingrealtimelow-latency
hexgrad

Kokoro TTS

Quality 9.2 · Speed 9.8 · 0.33GB · Apache 2.0

Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.

realtimestreaminglow-latencymultilingual
Kyutai

Kyutai STT 2.6B

Quality 9.4 · Speed 9.5 · 2.7GB · CC-BY-4.0

Production-grade streaming ASR from Kyutai (makers of Moshi). Delay-streaming transformer with 500ms latency, word-level timestamps, speaker diarization. Top of Open ASR Leaderboard for real-time French + English.

streamingrealtimelow-latencymultilingual
OpenAI

Whisper v3 Turbo

Quality 9.1 · Speed 9.5 · 1.6GB · MIT

OpenAI's optimized Whisper v3 with 4 decoder layers instead of 32. 8× faster than Whisper Large v3 with only minor accuracy trade-off. 99 languages supported. New gold standard for fast local transcription.

streamingrealtimemultilinguallow-latency
jamiepine / Community

Voicebox

Quality 9 · Speed 9.5 · 0.05GB · MIT

Desktop app & orchestrator for local TTS - not a model. Provides a UI studio, voice profile management, and a local API. Generates audio via swappable backends (Qwen3 TTS, Kokoro, Piper, XTTS…). Think of it as a front-end shell that runs on top of your installed TTS models.

streamingrealtimelow-latency
Kyutai

Moshi

Quality 9 · Speed 9.5 · 7.5GB · CC-BY-4.0

Full-duplex spoken dialogue model - listens and speaks simultaneously with ~160 ms latency. Not just a TTS but a real-time conversational speech model. Runs on a single L4 GPU or Mac M3 Pro.

dialoguestreamingrealtimelow-latencyemotion