voice cloning

Best local TTS for voice cloning

A practical shortlist of local TTS and speech models for private voice cloning, expressive generation and offline voice pipelines.

Run recommender Open full TTS list

Quick answer

For voice cloning, start with models that explicitly support cloning or expressive speaker control, then test pronunciation, consent requirements and license constraints before production use.

Recommended starting points

F5-TTS v1.1

Speech Research (SWivid) · quality 9.5/10 · speed 9.2/10

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

realtimecloningstreamingmultilingual

F5-TTS

Speech Research · quality 9.4/10 · speed 9/10

Flow-matching based TTS with SOTA quality and extremely fast inference. Simple and efficient architecture.

realtimecloningstreaming

MaskGCT

Amphion Team · quality 9.4/10 · speed 9/10

Fully non-autoregressive TTS — no text-phone alignment needed. Achieves human parity on naturalness and similarity metrics. Incredibly fast inference.

cloningrealtimestreaming

Zonos v0.1

Zyphra · quality 9.5/10 · speed 8.5/10

1.6B open-weight TTS with ultra-realistic zero-shot cloning from 5-30 s audio. Fine-grained controls: speaking rate, pitch, emotion (happy/sad/angry/fear). Streaming with ~200 ms first-token latency.

cloningemotionstreamingrealtimecontrollable

NeuTTS Air

Neuphonic · quality 9/10 · speed 9.5/10

First super-realistic TTS LLM that runs in real-time on CPU. 748M params, LLaMA 3.2 backbone + NeuCodec audio tokenizer. GGUF-native — perfect for on-device agents and offline apps. Instant 3s voice cloning.

cloningrealtimestreaminglow-latency

CosyVoice 2

Alibaba FunAudioLLM · quality 9.3/10 · speed 8.8/10

Industrial-grade multilingual TTS with streaming, voice cloning and emotion control. Exceptional Chinese + English quality. Used in production at Alibaba scale.

streamingrealtimecloningemotionmultilingual

VoxCPM2

OpenBMB · quality 9.4/10 · speed 8.3/10

Tokenizer-free diffusion autoregressive TTS with 2B parameters, 30 languages, 48kHz output, voice design, controllable cloning and real-time streaming. Apache 2.0 and commercial-ready.

cloningstreamingrealtimemultilingualcontrollableemotion

MeloTTS

MYShell · quality 9/10 · speed 9/10

High-quality multilingual TTS with extremely natural voice cloning. Best for Chinese and English with fast inference.

cloningrealtimemultilingual

Chatterbox TTS

Resemble AI · quality 9.4/10 · speed 8/10

Open-source SOTA voice cloning from Resemble AI. Outperforms ElevenLabs on naturalness benchmarks. Supports emotion exaggeration control and ultra-stable generation.

cloningemotionstreaming

Keep exploring

TTSAll speech models GuideLocal TTS guide ModelMisoTTS AppGet LocalClaw

Source checks

These guides use LocalClaw's internal model database for scoring, then avoid hard claims beyond public hardware and model availability signals checked before publishing.

MisoTTS GitHub repository →WavTTS project page →