PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong

TL;DR
PhyX is a large-scale benchmark designed to evaluate models' ability for physics-based reasoning in visual scenarios, revealing significant gaps in current AI models' understanding of physical principles.
Contribution
The paper introduces PhyX, the first comprehensive benchmark for assessing physics-grounded reasoning in multimodal visual questions, highlighting current models' limitations.
Findings
State-of-the-art models achieve only around 32-45% accuracy.
Models rely heavily on memorized knowledge and surface patterns.
Current models show significant gaps compared to human performance.
Abstract
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The proposed physical reasoning tasks are indeed important and crucial to model intelligence. 2. The construction process of the benchmark is detailed and reasonable. 3. Experiments are conducted comprehensively, which leads to several findings.
1. From my understanding, the authors try to equate “physical reasoning” with the ability to solve challenging physics problems, suggesting that a model performing well on such tasks demonstrates strong reasoning capabilities. However, if a model performs well on gravity-related problems, can it generalize the same underlying principles to buoyancy, which essentially involves the same concept but in the opposite force direction? This kind of "generalization" or "learning the core idea" are also
- Important problem focus. Physical reasoning that integrates perception, symbolic manipulation, and real world constraints is a valuable target for evaluation. - Breadth of coverage across six physics domains and six reasoning categories with both MC and OE formats. - Three input variants to study redundancy and text dependence. - Integration with common eval toolkits and release plan for one click evaluation. - Evaluation includes both MLLMs and text only LLMs through captions, which enables c
- Novelty claim appears overstated. Several 2025 benchmarks already target physics reasoning with images, for example PhysReason, UGPhysics, SeePhys, PhysUniBench, and others. PHYX is larger and uses some forms of de redundancy, but the claim of first large scale benchmark is not well supported. - Human baseline is too small and too weak (table 2 is basically empty for the human baselines). Only 15 students answered 18 questions each, with no per question overlap, no variance estimates, and seem
- The dataset collected covers a wide range of physics problems. - The text deredundancy step is useful in isolating the multimodal information and avoid information compensation from the text description. - The data collection process includes several human verification steps to maintain the quality.
- The expert background is not well introduced. For example, whether they have different strength under the subdomains in the dataset and whether it is considered in the evaluation. - The findings and analysis does not provide enough insights to sharpen the understanding of the boundary of existing LLMs.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Machine Learning in Materials Science
