Model WarsMarch 4, 2026via Microsoft Research

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

Why it matters

Microsoft's release of a 15B open-weight multimodal reasoning model signals a major shift toward reasoning-capable vision models at accessible scale, challenging the assumption that advanced multimodal reasoning requires massive closed models.

Key signals

Phi-4-reasoning-vision-15B: 15 billion parameters
Multimodal reasoning model (vision + language)
Open-weight distribution via Microsoft Foundry, HuggingFace, GitHub
Positioned for image captioning and vision-language reasoning tasks
Published March 4, 2026 from Microsoft Research

The hook

Microsoft just shipped Phi-4-reasoning-vision-15B. A 15B multimodal reasoning model that rewrites what's possible at that scale.

We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking […] The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.

Read full story on Microsoft Research

Relevance score:78/100

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

Get stories like this every Friday.