Local TTS model

Fish Speech

Fast, high-quality TTS with voice cloning and multilingual support. Optimized for real-time applications.

Quality
9/10
Speed
8.5/10
Size
3GB
Languages
10+

Quick answer

Fish Speech is a local speech model from Fish Audio. It is best suited for streaming, realtime, cloning, multilingual workflows. Check the license before commercial use.

Model details

Hardware

gpuapple

Formats

pytorchonnx

Voices

Unlimited cloning

Latency

ultra-low

License

Apache 2.0

Release

2024-06

Install command

pip install fish-speech

Features

streamingrealtimecloningmultilingual

Languages: en, zh, ja, ko, fr, de, es, it, pt, ru

Context: Instant cloning

Related TTS models

NVIDIA

Parakeet TDT 0.6B v2

Quality 9.4 · Speed 10 · 1.1GB · CC-BY-4.0

NVIDIA's SOTA lightweight ASR - 0.6B params, #1 on Open ASR Leaderboard for English. TDT (Token-and-Duration Transducer) decoding makes it 50× faster than Whisper Large v3 on GPU. Real-time streaming with word-level timestamps.

streamingrealtimelow-latency
hexgrad

Kokoro TTS

Quality 9.2 · Speed 9.8 · 0.33GB · Apache 2.0

Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.

realtimestreaminglow-latencymultilingual
Kyutai

Kyutai STT 2.6B

Quality 9.4 · Speed 9.5 · 2.7GB · CC-BY-4.0

Production-grade streaming ASR from Kyutai (makers of Moshi). Delay-streaming transformer with 500ms latency, word-level timestamps, speaker diarization. Top of Open ASR Leaderboard for real-time French + English.

streamingrealtimelow-latencymultilingual
Speech Research (SWivid)

F5-TTS v1.1

Quality 9.5 · Speed 9.2 · 1.6GB · MIT

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

realtimecloningstreamingmultilingual
OpenAI

Whisper v3 Turbo

Quality 9.1 · Speed 9.5 · 1.6GB · MIT

OpenAI's optimized Whisper v3 with 4 decoder layers instead of 32. 8× faster than Whisper Large v3 with only minor accuracy trade-off. 99 languages supported. New gold standard for fast local transcription.

streamingrealtimemultilinguallow-latency
jamiepine / Community

Voicebox

Quality 9 · Speed 9.5 · 0.05GB · MIT

Desktop app & orchestrator for local TTS - not a model. Provides a UI studio, voice profile management, and a local API. Generates audio via swappable backends (Qwen3 TTS, Kokoro, Piper, XTTS…). Think of it as a front-end shell that runs on top of your installed TTS models.

streamingrealtimelow-latency