PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

Hui Shen; Taiqiang Wu; Qi Han; Yunta Hsieh; Jizhou Wang; Yuyue Zhang; Yuxin Cheng; Zijian Hao; Yuansheng Ni; Xin Wang; Zhongwei Wan; Kai Zhang; Wendong Xu; Jing Xiong; Ping Luo; Wenhu Chen; Chaofan Tao; Zhuoqing Mao; Ngai Wong

arXiv:2505.15929·cs.AI·May 30, 2025

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong

PDF

Open Access 3 Repos 2 Datasets 3 Reviews

TL;DR

PhyX is a large-scale benchmark designed to evaluate models' ability for physics-based reasoning in visual scenarios, revealing significant gaps in current AI models' understanding of physical principles.

Contribution

The paper introduces PhyX, the first comprehensive benchmark for assessing physics-grounded reasoning in multimodal visual questions, highlighting current models' limitations.

Findings

01

State-of-the-art models achieve only around 32-45% accuracy.

02

Models rely heavily on memorized knowledge and surface patterns.

03

Current models show significant gaps compared to human performance.

Abstract

Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. The proposed physical reasoning tasks are indeed important and crucial to model intelligence. 2. The construction process of the benchmark is detailed and reasonable. 3. Experiments are conducted comprehensively, which leads to several findings.

Weaknesses

1. From my understanding, the authors try to equate “physical reasoning” with the ability to solve challenging physics problems, suggesting that a model performing well on such tasks demonstrates strong reasoning capabilities. However, if a model performs well on gravity-related problems, can it generalize the same underlying principles to buoyancy, which essentially involves the same concept but in the opposite force direction? This kind of "generalization" or "learning the core idea" are also

Reviewer 02Rating 2Confidence 5

Strengths

- Important problem focus. Physical reasoning that integrates perception, symbolic manipulation, and real world constraints is a valuable target for evaluation. - Breadth of coverage across six physics domains and six reasoning categories with both MC and OE formats. - Three input variants to study redundancy and text dependence. - Integration with common eval toolkits and release plan for one click evaluation. - Evaluation includes both MLLMs and text only LLMs through captions, which enables c

Weaknesses

- Novelty claim appears overstated. Several 2025 benchmarks already target physics reasoning with images, for example PhysReason, UGPhysics, SeePhys, PhysUniBench, and others. PHYX is larger and uses some forms of de redundancy, but the claim of first large scale benchmark is not well supported. - Human baseline is too small and too weak (table 2 is basically empty for the human baselines). Only 15 students answered 18 questions each, with no per question overlap, no variance estimates, and seem

Reviewer 03Rating 4Confidence 3

Strengths

- The dataset collected covers a wide range of physics problems. - The text deredundancy step is useful in isolating the multimodal information and avoid information compensation from the text description. - The data collection process includes several human verification steps to maintain the quality.

Weaknesses

- The expert background is not well introduced. For example, whether they have different strength under the subdomains in the dataset and whether it is considered in the evaluation. - The findings and analysis does not provide enough insights to sharpen the understanding of the boundary of existing LLMs.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Machine Learning in Materials Science