PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, Yue, Wang

TL;DR
PhysBench is a comprehensive benchmark designed to evaluate and improve vision-language models' understanding of physical phenomena, addressing a key gap in embodied AI capabilities.
Contribution
The paper introduces PhysBench, a large-scale benchmark for physical understanding, and PhysAgent, a framework that enhances VLMs' physical reasoning abilities.
Findings
VLMs excel in common-sense reasoning but struggle with physical understanding.
PhysAgent significantly improves VLMs' performance on physical tasks.
Enhancing physical understanding in VLMs benefits embodied AI applications.
Abstract
Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Robotics and Automated Systems
MethodsSparse Evolutionary Training
