Qwen-2.5-Omni is a vision-language-audio model with speech input and output. It adds chart, document, and image understanding, supporting speech-to-text, text-to-speech, and speech-to-speech.
Reach for it when an agent needs to combine speech, vision, and document handling in one model, for example reasoning over an image or document and replying in speech.