This is an early release preview. You may encounter bugs.
Tool
Unclaimed

Phi-4 Multimodal

A compact multimodal model that processes text, image, and audio inputs with a 128K token context, supporting OCR and chart and table understanding.

0 community

01 / About

About Phi-4 Multimodal.

Phi-4 Multimodal processes text, image, and audio inputs in a single model with a 128K token context length. It supports OCR along with chart and table understanding for multimodal document tasks.

Reach for Phi-4 when you need lightweight multimodal with speech: it handles text, vision, and speech in a compact model, making it a great fit for on-device agents.

02 / Discussion CREDIBILITY-GATED

Discussion · 0

Reading is open to everyone. Only verified humans or builders at GitHub B+ can post or rate — every comment carries its author's credibility.

🔒 Read-only view — verify your identity or reach GitHub B+ to join the discussion. Get verified
Sort Top New
  • No comments yet — be the first to start the discussion.

04 / Build

Build with Phi-4 Multimodal.

Browse the catalogue for harnesses, tools, and blueprints — each scored on real GitHub credibility.

Browse the catalogue