Local TTS model
VoxCPM2
Tokenizer-free diffusion autoregressive TTS with 2B parameters, 30 languages, 48kHz output, voice design, controllable cloning and real-time streaming. Apache 2.0 and commercial-ready.
GPU recommended
text-to-speech generation
30 languages
Apache 2.0
Quality
9.4/10
Speed
8.3/10
Model size
4.2 GB
Voices
Voice design + short-clip controllable cloning
Can VoxCPM2 run locally?
VoxCPM2 can generate speech locally for private voice workflows. Start with pip install voxcpm.
Apache 2.0 license. Still verify upstream usage notes before shipping.
pip install voxcpm
Upstream source
cloningstreamingrealtimemultilingualcontrollableemotion
Audio profile
Best fit
VoxCPM2 is best for local voice cloning and expressive speech generation.
Hardware: gpuapple
Model details
Type
Local TTS model
Family
voxcpm
Latency
low
Formats
pytorchsafetensors
Languages
zh, en, ar, my, da, nl, fi, fr, de, el, he, hi, id, it, ja, km, ko, lo, ms, no, pl, pt, ru, es, sw, sv, tl, th, tr, vi
Context
2B params, 30 languages, 48kHz output, streaming RTF as low as ~0.3 on RTX 4090
Install locally
01
Check runtimeConfirm the backend supports pytorch, safetensors on your machine.02
Install modelUse the upstream command or repository instructions.03
Test locallyRun a short private audio prompt before moving into production workflows.pip install voxcpm
Good for
- text-to-speech generation
- GPU recommended local workflows
- cloning, streaming, realtime
Watch before shipping
- Validate pronunciation, latency and artifacts with your own voice samples.
- Review the upstream license and acceptable-use notes.
- Benchmark on your target CPU, Apple Silicon or GPU setup.
Related TTS and speech models
Zyphra
Zonos v0.1
Local TTS model · Q 9.5 · Speed 8.5
Alibaba FunAudioLLM
CosyVoice 2
Local TTS model · Q 9.3 · Speed 8.8
Speech Research (SWivid)
F5-TTS v1.1
Local TTS model · Q 9.5 · Speed 9.2
Alibaba Cloud (Qwen Team)
Qwen3 TTS
Local TTS model · Q 9.5 · Speed 8.5
Bilibili
IndexTTS 2
Local TTS model · Q 9.4 · Speed 8
MyShell
OpenVoice V2
Local TTS model · Q 8.9 · Speed 9
OpenMOSS / MOSI.AI
MOSS-TTS-Nano
Local TTS model · Q 8.5 · Speed 9.7
Fish Audio
Fish Speech
Local TTS model · Q 9 · Speed 8.5