Local TTS model

StyleTTS 2

Style-based TTS with high naturalness and style diffusion. Academic research model with excellent quality.

Quality
9.3/10
Speed
6.5/10
Size
1.8GB
Languages
3+

Quick answer

StyleTTS 2 is a local speech model from Y.L. Ma et al.. It is best suited for controllable, cloning workflows. Check the license before commercial use.

Model details

Hardware

gpu

Formats

pytorch

Voices

Style transfer

Latency

medium

License

MIT

Release

2024-01

Install command

pip install styletts2

Features

controllablecloning

Languages: en, zh, ja

Context: Style diffusion

Related TTS models

Speech Research (SWivid)

F5-TTS v1.1

Quality 9.5 · Speed 9.2 · 1.6GB · MIT

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

realtimecloningstreamingmultilingual
Neuphonic

NeuTTS Air

Quality 9 · Speed 9.5 · 0.75GB · Apache 2.0

First super-realistic TTS LLM that runs in real-time on CPU. 748M params, LLaMA 3.2 backbone + NeuCodec audio tokenizer. GGUF-native - perfect for on-device agents and offline apps. Instant 3s voice cloning.

cloningrealtimestreaminglow-latency
Speech Research

F5-TTS

Quality 9.4 · Speed 9 · 1.5GB · MIT

Flow-matching based TTS with SOTA quality and extremely fast inference. Simple and efficient architecture.

realtimecloningstreaming
Amphion Team

MaskGCT

Quality 9.4 · Speed 9 · 2.8GB · MIT

Fully non-autoregressive TTS - no text-phone alignment needed. Achieves human parity on naturalness and similarity metrics. Incredibly fast inference.

cloningrealtimestreaming
Alibaba FunAudioLLM

CosyVoice 2

Quality 9.3 · Speed 8.8 · 2.4GB · Apache 2.0

Industrial-grade multilingual TTS with streaming, voice cloning and emotion control. Exceptional Chinese + English quality. Used in production at Alibaba scale.

streamingrealtimecloningemotionmultilingual
MYShell

MeloTTS

Quality 9 · Speed 9 · 1.5GB · MIT

High-quality multilingual TTS with extremely natural voice cloning. Best for Chinese and English with fast inference.

cloningrealtimemultilingual