●

Unclaimed

Phi-4 Multimodal

A compact multimodal model that processes text, image, and audio inputs with a 128K token context, supporting OCR and chart and table understanding.

0 community

01 / About

About Phi-4 Multimodal.

Phi-4 Multimodal processes text, image, and audio inputs in a single model with a 128K token context length. It supports OCR along with chart and table understanding for multimodal document tasks.

Reach for Phi-4 when you need lightweight multimodal with speech: it handles text, vision, and speech in a compact model, making it a great fit for on-device agents.

02 / Discussion CREDIBILITY-GATED

Discussion · 0

Reading is open to everyone. Only verified humans or builders at GitHub B+ can post or rate — every comment carries its author's credibility.

🔒 Read-only view — verify your identity or reach GitHub B+ to join the discussion. Get verified

Sort Top New

No comments yet — be the first to start the discussion.

03 / Related

More to explore.

Browser Use

Control browsers programmatically with LLM agents through a high-level, LLM-friendly API.

Score unavailable

Cavegemma

JuliusBrussee

An experimental LoRA fine-tune of Gemma to speak caveman-mode natively.

caveman-code

JuliusBrussee

A TypeScript implementation of the caveman compression engine.

claude-context-optimizer

egorfedorov

Tracks token usage, blocks redundant reads, and supports .contextignore and budget alerts.

claude-rolling-context

NodeNestor

A proxy plugin that rolls context compression past 100K tokens.

claude-token-optimizer

nadimtuhin

Restructures CLAUDE.md and docs for roughly 90% token savings with a CLI audit and compress.

04 / Build

Build with Phi-4 Multimodal.

Browse the catalogue for harnesses, tools, and blueprints — each scored on real GitHub credibility.

Browse the catalogue