DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning
Xinrun Xu, Pi Bu, Ye Wang, B\"orje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, Bo Zheng

TL;DR
DeepPHY is a new benchmark framework that assesses vision-language models' understanding of physical principles through simulated environments, revealing their struggles with precise control despite strong perceptual abilities.
Contribution
We introduce DeepPHY, a comprehensive benchmark for evaluating physical reasoning in VLMs using diverse simulated environments and detailed metrics.
Findings
State-of-the-art VLMs have difficulty translating physical knowledge into control.
DeepPHY provides a systematic way to evaluate physical reasoning.
Models show limited performance in complex physical tasks.
Abstract
Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
