Probing Mechanical Reasoning in Large Vision Language Models
Haoran Sun, Qingying Gao, Haiyun Lyu, Dezhi Luo, Yijiang Li, Hokin Deng

TL;DR
This paper evaluates the mechanical reasoning abilities of 26 Vision Language Models across various physics domains, revealing significant gaps compared to human performance and highlighting limitations in current architectures.
Contribution
It introduces a comprehensive benchmark of mechanical reasoning tasks for VLMs and uncovers their persistent shortcomings, especially in complex physics reasoning.
Findings
VLMs perform worse than humans across all tested domains.
Performance does not improve with larger model sizes.
Current architectures struggle with mental simulation tasks in physics.
Abstract
Mechanical reasoning is a hallmark of human intelligence, defined by its ubiquitous yet irreplaceable role in human activities ranging from routine tasks to civil engineering. Embedding machines with mechanical reasoning is therefore an important step towards building human-level artificial intelligence. Here, we leveraged 155 cognitive experiments to test the understanding of system stability, gears and pulley systems, leverage principle, inertia and motion, and fluid mechanics in 26 Vision Language Models (VLMs). Results indicate that VLMs consistently perform worse than humans on all domains, while demonstrate significant difficulty in reasoning about gear systems and fluid mechanics. Notably, their performance on these tasks do not improve as number of parameters increase, suggesting that current attention-based architecture may fail to grasp certain underlying mechanisms required…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
