DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Xinrun Xu; Pi Bu; Ye Wang; B\"orje F. Karlsson; Ziming Wang; Tengtao Song; Qi Zhu; Jun Song; Zhiming Ding; Bo Zheng

arXiv:2508.05405·cs.AI·August 8, 2025

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Xinrun Xu, Pi Bu, Ye Wang, B\"orje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, Bo Zheng

PDF

TL;DR

DeepPHY is a new benchmark framework that assesses vision-language models' understanding of physical principles through simulated environments, revealing their struggles with precise control despite strong perceptual abilities.

Contribution

We introduce DeepPHY, a comprehensive benchmark for evaluating physical reasoning in VLMs using diverse simulated environments and detailed metrics.

Findings

01

State-of-the-art VLMs have difficulty translating physical knowledge into control.

02

DeepPHY provides a systematic way to evaluate physical reasoning.

03

Models show limited performance in complex physical tasks.

Abstract

Although Vision Language Models (VLMs) exhibit strong perceptual abilities and impressive visual reasoning, they struggle with attention to detail and precise action planning in complex, dynamic environments, leading to subpar performance. Real-world tasks typically require complex interactions, advanced spatial reasoning, long-term planning, and continuous strategy refinement, usually necessitating understanding the physics rules of the target scenario. However, evaluating these capabilities in real-world scenarios is often prohibitively expensive. To bridge this gap, we introduce DeepPHY, a novel benchmark framework designed to systematically evaluate VLMs' understanding and reasoning about fundamental physical principles through a series of challenging simulated environments. DeepPHY integrates multiple physical reasoning environments of varying difficulty levels and incorporates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.