Phy-Q as a measure for physical reasoning intelligence
Cheng Xue, Vimukthini Pinto, Chathura Gamage, Ekaterina Nikonova, Peng, Zhang, Jochen Renz

TL;DR
This paper introduces Phy-Q, a new benchmark for measuring physical reasoning intelligence in AI agents through diverse scenarios, revealing current agents' significant performance gap compared to humans.
Contribution
The paper presents a novel testbed with physical scenarios and a scoring metric, enabling evaluation of physical reasoning and generalization in AI agents.
Findings
All tested agents perform below human levels.
Learning agents struggle with physical rule generalization.
Current agents show limited physical reasoning capabilities.
Abstract
Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. We create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning
