voice cloning

Best local TTS for voice cloning

A practical shortlist of local TTS and speech models for private voice cloning, expressive generation and offline voice pipelines.

Quick answer

For voice cloning, start with models that explicitly support cloning or expressive speaker control, then test pronunciation, consent requirements and license constraints before production use.

Recommended starting points

#1

F5-TTS v1.1

Speech Research (SWivid) · quality 9.5/10 · speed 9.2/10

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

realtimecloningstreamingmultilingual
#2

F5-TTS

Speech Research · quality 9.4/10 · speed 9/10

Flow-matching based TTS with SOTA quality and extremely fast inference. Simple and efficient architecture.

realtimecloningstreaming
#3

MaskGCT

Amphion Team · quality 9.4/10 · speed 9/10

Fully non-autoregressive TTS — no text-phone alignment needed. Achieves human parity on naturalness and similarity metrics. Incredibly fast inference.

cloningrealtimestreaming
#4

Zonos v0.1

Zyphra · quality 9.5/10 · speed 8.5/10

1.6B open-weight TTS with ultra-realistic zero-shot cloning from 5-30 s audio. Fine-grained controls: speaking rate, pitch, emotion (happy/sad/angry/fear). Streaming with ~200 ms first-token latency.

cloningemotionstreamingrealtimecontrollable
#5

NeuTTS Air

Neuphonic · quality 9/10 · speed 9.5/10

First super-realistic TTS LLM that runs in real-time on CPU. 748M params, LLaMA 3.2 backbone + NeuCodec audio tokenizer. GGUF-native — perfect for on-device agents and offline apps. Instant 3s voice cloning.

cloningrealtimestreaminglow-latency
#6

CosyVoice 2

Alibaba FunAudioLLM · quality 9.3/10 · speed 8.8/10

Industrial-grade multilingual TTS with streaming, voice cloning and emotion control. Exceptional Chinese + English quality. Used in production at Alibaba scale.

streamingrealtimecloningemotionmultilingual
#7

VoxCPM2

OpenBMB · quality 9.4/10 · speed 8.3/10

Tokenizer-free diffusion autoregressive TTS with 2B parameters, 30 languages, 48kHz output, voice design, controllable cloning and real-time streaming. Apache 2.0 and commercial-ready.

cloningstreamingrealtimemultilingualcontrollableemotion
#8

MeloTTS

MYShell · quality 9/10 · speed 9/10

High-quality multilingual TTS with extremely natural voice cloning. Best for Chinese and English with fast inference.

cloningrealtimemultilingual
#9

Chatterbox TTS

Resemble AI · quality 9.4/10 · speed 8/10

Open-source SOTA voice cloning from Resemble AI. Outperforms ElevenLabs on naturalness benchmarks. Supports emotion exaggeration control and ultra-stable generation.

cloningemotionstreaming

Keep exploring

Source checks

These guides use LocalClaw's internal model database for scoring, then avoid hard claims beyond public hardware and model availability signals checked before publishing.