PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
Liang Ma, Jiajun Wen, Min Lin, Rongtao Xu, Xiwen Liang, Bingqian Lin, Jun Ma, Yongxin Wang, Ziming Wei, Haokun Lin, Mingfei Han, Meng Cao, Bokui Chen, Ivan Laptev, Xiaodan Liang

TL;DR
PhyBlock is a comprehensive benchmark designed to evaluate vision-language models on their physical understanding and planning abilities in 3D block assembly tasks, revealing significant limitations in high-level reasoning and spatial comprehension.
Contribution
The paper introduces PhyBlock, a novel progressive benchmark with a hierarchical task structure and VQA components to assess physical reasoning in VLMs, filling a critical evaluation gap.
Findings
VLMs show limited high-level planning capabilities.
Performance declines with task complexity.
Chain-of-thought prompting has minimal impact.
Abstract
While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
