Local TTS model

Whisper v3 Turbo

OpenAI's optimized Whisper v3 with 4 decoder layers instead of 32. 8× faster than Whisper Large v3 with only minor accuracy trade-off. 99 languages supported. New gold standard for fast local transcription.

Model page Browse all TTS models

Quality

9.1/10

Speed

9.5/10

Size

1.6GB

Languages

99+

Quick answer

Whisper v3 Turbo is a local speech model from OpenAI. It is best suited for streaming, realtime, multilingual, low-latency workflows. Check the license before commercial use.

Model details

Hardware

cpugpuapple

Formats

pytorchsafetensorsgguf

Voices

N/A (ASR: outputs text)

Latency

ultra-low

License

MIT

Release

2024-10

Install command

pip install openai-whisper

Features

streamingrealtimemultilinguallow-latency

Languages: multilingual

Context: 809M params, 4 decoder layers, 99 languages

Related TTS models

NVIDIA

Parakeet TDT 0.6B v2

Quality 9.4 · Speed 10 · 1.1GB · CC-BY-4.0

NVIDIA's SOTA lightweight ASR - 0.6B params, #1 on Open ASR Leaderboard for English. TDT (Token-and-Duration Transducer) decoding makes it 50× faster than Whisper Large v3 on GPU. Real-time streaming with word-level timestamps.

streamingrealtimelow-latency

hexgrad

Kokoro TTS

Quality 9.2 · Speed 9.8 · 0.33GB · Apache 2.0

Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.

realtimestreaminglow-latencymultilingual

Kyutai

Kyutai STT 2.6B

Quality 9.4 · Speed 9.5 · 2.7GB · CC-BY-4.0

Production-grade streaming ASR from Kyutai (makers of Moshi). Delay-streaming transformer with 500ms latency, word-level timestamps, speaker diarization. Top of Open ASR Leaderboard for real-time French + English.

streamingrealtimelow-latencymultilingual

Speech Research (SWivid)

F5-TTS v1.1

Quality 9.5 · Speed 9.2 · 1.6GB · MIT

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

realtimecloningstreamingmultilingual

jamiepine / Community

Voicebox

Quality 9 · Speed 9.5 · 0.05GB · MIT

Desktop app & orchestrator for local TTS - not a model. Provides a UI studio, voice profile management, and a local API. Generates audio via swappable backends (Qwen3 TTS, Kokoro, Piper, XTTS…). Think of it as a front-end shell that runs on top of your installed TTS models.

streamingrealtimelow-latency

Kyutai

Moshi

Quality 9 · Speed 9.5 · 7.5GB · CC-BY-4.0

Full-duplex spoken dialogue model - listens and speaks simultaneously with ~160 ms latency. Not just a TTS but a real-time conversational speech model. Runs on a single L4 GPU or Mac M3 Pro.

dialoguestreamingrealtimelow-latencyemotion