Local TTS model

MisoTTS

8B English-first emotive conversational TTS model designed for natural dialogue, voice continuation from prompt audio and private local speech experiments. Excellent quality signal, but heavier than small TTS models and best on CUDA GPUs or larger Apple Silicon setups.

GPU recommended text-to-speech generation 1 languages Other / custom
Quality
9.4/10
Speed
5.8/10
Model size
16 GB
Voices
Emotive English dialogue + prompt audio continuation

Can MisoTTS run locally?

MisoTTS can generate speech locally for private voice workflows. Start with git clone https://github.com/MisoLabsAI/MisoTTS.

Other / custom license. Review upstream restrictions before commercial use.

emotiondialoguecloningcontrollable

Audio profile

Quality
9.4
Speed
5.8
Local
7.5

Best fit

MisoTTS is best for local voice cloning and expressive speech generation.

Hardware: gpuapple

Model details

Type
Local TTS model
Family
miso
Latency
medium
Formats
pytorchsafetensors
Languages
en
Context
8B text-to-dialogue RVQ Transformer, Mimi audio tokenizer, 2048 max sequence length

Install locally

01
Check runtimeConfirm the backend supports pytorch, safetensors on your machine.
02
Install modelUse the upstream command or repository instructions.
03
Test locallyRun a short private audio prompt before moving into production workflows.
git clone https://github.com/MisoLabsAI/MisoTTS

Good for

  • text-to-speech generation
  • GPU recommended local workflows
  • emotion, dialogue, cloning

Watch before shipping

  • Validate pronunciation, latency and artifacts with your own voice samples.
  • Review the upstream license and acceptable-use notes.
  • Benchmark on your target CPU, Apple Silicon or GPU setup.

Related TTS and speech models

CompareBrowse all TTS models Local AIBrowse LLM models macOS appGet LocalClaw