Model WarsMarch 4, 2026via Microsoft Research
Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model
Why it matters
Microsoft's release of a 15B open-weight multimodal reasoning model signals a major shift toward reasoning-capable vision models at accessible scale, challenging the assumption that advanced multimodal reasoning requires massive closed models.
Key signals
- Phi-4-reasoning-vision-15B: 15 billion parameters
- Multimodal reasoning model (vision + language)
- Open-weight distribution via Microsoft Foundry, HuggingFace, GitHub
- Positioned for image captioning and vision-language reasoning tasks
- Published March 4, 2026 from Microsoft Research
The hook
Microsoft just shipped Phi-4-reasoning-vision-15B. A 15B multimodal reasoning model that rewrites what's possible at that scale.
We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking […]
The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.
Relevance score:78/100