PhysBench: Benchmarking and Enhancing Vision-Language Models for   Physical World Understanding

Wei Chow; Jiageng Mao; Boyi Li; Daniel Seita; Vitor Guizilini; Yue; Wang

arXiv:2501.16411·cs.CV·January 30, 2025·2 cites

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, Yue, Wang

PDF

Open Access 1 Datasets 1 Video

TL;DR

PhysBench is a comprehensive benchmark designed to evaluate and improve vision-language models' understanding of physical phenomena, addressing a key gap in embodied AI capabilities.

Contribution

The paper introduces PhysBench, a large-scale benchmark for physical understanding, and PhysAgent, a framework that enhances VLMs' physical reasoning abilities.

Findings

01

VLMs excel in common-sense reasoning but struggle with physical understanding.

02

PhysAgent significantly improves VLMs' performance on physical tasks.

03

Enhancing physical understanding in VLMs benefits embodied AI applications.

Abstract

Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

USC-PSI-Lab/PhysBench
dataset· 659 dl
659 dl

Videos

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding· slideslive

Taxonomy

TopicsSemantic Web and Ontologies · Robotics and Automated Systems

MethodsSparse Evolutionary Training