Local TTS model

GPT-SoVITS

Zero-shot voice cloning TTS combining GPT and SoVITS. Clone any voice from 5 seconds of audio. Extremely popular in the open-source community with 40K+ GitHub stars.

Quality
9.1/10
Speed
7/10
Size
2GB
Languages
5+

Quick answer

GPT-SoVITS is a local speech model from RVC-Boss. It is best suited for cloning, multilingual, emotion workflows. Check the license before commercial use.

Model details

Hardware

gpuapple

Formats

pytorch

Voices

Zero-shot cloning from 5s

Latency

medium

License

MIT

Release

2024-01

Install command

git clone https://github.com/RVC-Boss/GPT-SoVITS

Features

cloningmultilingualemotion

Languages: en, zh, ja, ko, yue

Context: GPT + SoVITS hybrid

Related TTS models

hexgrad

Kokoro TTS

Quality 9.2 · Speed 9.8 · 0.33GB · Apache 2.0

Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.

realtimestreaminglow-latencymultilingual
Kyutai

Kyutai STT 2.6B

Quality 9.4 · Speed 9.5 · 2.7GB · CC-BY-4.0

Production-grade streaming ASR from Kyutai (makers of Moshi). Delay-streaming transformer with 500ms latency, word-level timestamps, speaker diarization. Top of Open ASR Leaderboard for real-time French + English.

streamingrealtimelow-latencymultilingual
Speech Research (SWivid)

F5-TTS v1.1

Quality 9.5 · Speed 9.2 · 1.6GB · MIT

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

realtimecloningstreamingmultilingual
OpenAI

Whisper v3 Turbo

Quality 9.1 · Speed 9.5 · 1.6GB · MIT

OpenAI's optimized Whisper v3 with 4 decoder layers instead of 32. 8× faster than Whisper Large v3 with only minor accuracy trade-off. 99 languages supported. New gold standard for fast local transcription.

streamingrealtimemultilinguallow-latency
Kyutai

Moshi

Quality 9 · Speed 9.5 · 7.5GB · CC-BY-4.0

Full-duplex spoken dialogue model - listens and speaks simultaneously with ~160 ms latency. Not just a TTS but a real-time conversational speech model. Runs on a single L4 GPU or Mac M3 Pro.

dialoguestreamingrealtimelow-latencyemotion
Neuphonic

NeuTTS Air

Quality 9 · Speed 9.5 · 0.75GB · Apache 2.0

First super-realistic TTS LLM that runs in real-time on CPU. 748M params, LLaMA 3.2 backbone + NeuCodec audio tokenizer. GGUF-native - perfect for on-device agents and offline apps. Instant 3s voice cloning.

cloningrealtimestreaminglow-latency