Local TTS model

Bark (Suno)

Generative TTS that can produce highly expressive speech, music, and sound effects. Very creative but slower.

Quality
8.5/10
Speed
4/10
Size
4.2GB
Languages
13+

Quick answer

Bark (Suno) is a local speech model from Suno. It is best suited for multilingual, music, sfx workflows. Check the license before commercial use.

Model details

Hardware

gpuapple

Formats

pytorch

Voices

Preset voices + prompts

Latency

high

License

MIT

Release

2023-04

Install command

pip install bark

Features

multilingualmusicsfx

Languages: en, zh, fr, de, hi, it, ja, ko, pl, pt, ru, es, tr

Context: Prompt engineering

Related TTS models

hexgrad

Kokoro TTS

Quality 9.2 · Speed 9.8 · 0.33GB · Apache 2.0

Ultra-lightweight yet stunning quality. 82M params only - runs on CPU in real-time. Best quality-to-size ratio of any TTS model.

realtimestreaminglow-latencymultilingual
Kyutai

Kyutai STT 2.6B

Quality 9.4 · Speed 9.5 · 2.7GB · CC-BY-4.0

Production-grade streaming ASR from Kyutai (makers of Moshi). Delay-streaming transformer with 500ms latency, word-level timestamps, speaker diarization. Top of Open ASR Leaderboard for real-time French + English.

streamingrealtimelow-latencymultilingual
Speech Research (SWivid)

F5-TTS v1.1

Quality 9.5 · Speed 9.2 · 1.6GB · MIT

Iterative upgrade over the original F5-TTS. Faster convergence via improved flow-matching schedule, better Chinese prosody, cross-lingual cloning. Now with streaming inference and improved CFM sampler.

realtimecloningstreamingmultilingual
OpenAI

Whisper v3 Turbo

Quality 9.1 · Speed 9.5 · 1.6GB · MIT

OpenAI's optimized Whisper v3 with 4 decoder layers instead of 32. 8× faster than Whisper Large v3 with only minor accuracy trade-off. 99 languages supported. New gold standard for fast local transcription.

streamingrealtimemultilinguallow-latency
NVIDIA

Canary 1B v2

Quality 9.3 · Speed 9 · 2GB · CC-BY-4.0

NVIDIA multilingual ASR + speech translation in a single model. 25 European languages, bidirectional EN↔XX translation. Tops Open ASR Leaderboard multilingual category. Word-level timestamps, punctuation & capitalization.

streamingmultilingualrealtime
Alibaba FunAudioLLM

CosyVoice 2

Quality 9.3 · Speed 8.8 · 2.4GB · Apache 2.0

Industrial-grade multilingual TTS with streaming, voice cloning and emotion control. Exceptional Chinese + English quality. Used in production at Alibaba scale.

streamingrealtimecloningemotionmultilingual