Catalogue

Chatterbox

State-of-the-art 0.5B zero-shot text-to-speech model for expressive dialogue.

ChatTTS

Conversational text-to-speech for dialogue scenarios, in English and Chinese.

Conversational Speech Model (Sesame)

Multimodal model generating human-like conversational speech with autoregressive transformers.

Dia 1.6B

Expressive text-to-speech with emotion and tone control plus nonverbal sounds.

NVIDIA Parakeet v2

High-quality English speech recognition with punctuation and word-level timestamping.

Parler-TTS

Lightweight text-to-speech that generates natural speech in a given speaker style.

Pipecat

Open-source Python framework for building real-time voice and multimodal conversational agents.

Qwen-2.5-Omni

Vision-language-audio model with speech input and output plus document understanding.

Speaker Diarization 3.1

Identify and segment speakers in audio, outputting speaker diarization annotations.

Ultravox

Multimodal model for real-time voice interaction, consuming both speech and text inputs.

Voice Lab

Framework for testing and evaluating voice agents across models, prompts, and personas.

Whisper

General-purpose speech recognition trained on a large dataset of diverse audio.