Phi-4 Multimodal processes text, image, and audio inputs in a single model with a 128K token context length. It supports OCR along with chart and table understanding for multimodal document tasks.
Reach for Phi-4 when you need lightweight multimodal with speech: it handles text, vision, and speech in a compact model, making it a great fit for on-device agents.