The Conversational Speech Model from Sesame is a multimodal model that employs autoregressive transformers for speech generation. It supports speech-to-speech to produce human-like conversational speech.
Reach for it when an agent needs to generate natural, human-like spoken responses in a conversational setting rather than flat synthesized audio.
