Ultravox is a multimodal model built on Llama 3.1 70B that consumes both speech and text inputs in a single pipeline. It supports speech-to-text and speech-to-speech for real-time voice interaction.
Reach for it when you need an agent to handle real-time multimodal voice exchanges, taking spoken input and responding in speech without stitching together separate components.