Microsoft releases Phi-4-reasoning-vision-15B multimodal model with training insights

Model overview and availability
Phi-4-reasoning-vision-15B is a 15 billion parameter open-weight multimodal reasoning model that's available through Microsoft Foundry, HuggingFace, and GitHub. It's designed as a compact model that balances reasoning power, efficiency, and training data needs.
Capabilities and performance
The model handles a wide array of vision-language tasks including image captioning, asking questions about images, reading documents and receipts, helping with homework, and inferring about changes in sequences of images. It particularly excels at math and science reasoning and at understanding and grounding elements on computer and mobile screens.
Performance benchmarks show competitive results compared to slower models that require ten times or more compute-time and tokens, with better accuracy than similarly fast models for math and science reasoning. Benchmarks used include ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2.
Training approach and efficiency
The model was trained with just 200 billion tokens of multimodal data, leveraging Phi-4-reasoning (trained with 16 billion tokens) based on Phi-4 (400 billion unique tokens). This compares to more than 1 trillion tokens used for training other multimodal models like Qwen 2.5 VL, Qwen 3 VL, Kimi-VL, and Gemma3.
Microsoft emphasizes careful architecture choices, rigorous data curation, and using a mixture of reasoning and non-reasoning data as key lessons from training this model. The approach aims to push the pareto-frontier of the tradeoff between accuracy and compute costs.
Target use cases
The model is intended for resource-constrained or interactive settings where smaller, faster vision-language models are needed. It's lightweight enough to run on modest hardware while maintaining structured reasoning capabilities.
📖 Read the full source: HN AI Agents
👀 See Also

WhatsApp Auto-Reply Bug Silently Drops Media Images in OpenClaw 2026.4.2
A bug in OpenClaw 2026.4.2 causes WhatsApp auto-replies with MEDIA:./path/to/image.png to silently drop images while text-only replies work fine. The same agent configuration works correctly on Telegram.

OpenClaw's New Release: A Simple Name Change or a Major Upgrade?
OpenClaw, previously known as ClawDBot, has undergone a transformation. Read on to find out whether this change is merely cosmetic or introduces new features and improved stability.

Opus 4.6 Extended Thinking Performs Worse on Physics Diagram Problems
Testing shows Claude Opus 4.6 with extended thinking consistently fails physics problems involving visual diagram interpretation, while Gemini 3.1 Pro succeeds. Disabling extended thinking allows Opus 4.6 to solve the same problems correctly and faster.

SCOTUS Declines to Hear AI Copyright Case, Leaving Lower Court Ruling Intact
The U.S. Supreme Court declined to hear a dispute over copyrights for AI-generated material, leaving a lower court ruling that denied copyright protection for works created without human authorship in place.