PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?
Atharva Gundawar, Som Sagar, Ransalu Senanayake

TL;DR
PAC Bench is a comprehensive benchmark that evaluates vision-language models on their understanding of physical properties, affordances, and constraints crucial for reliable robot manipulation, revealing significant gaps in current models.
Contribution
The paper introduces PAC Bench, a large-scale benchmark dataset and evaluation framework for assessing VLMs' understanding of physical prerequisites in manipulation tasks.
Findings
Current VLMs show limited understanding of physical properties and constraints.
PAC Bench highlights key areas where VLMs need improvement for robotic applications.
Benchmark facilitates targeted research to develop more physically grounded models.
Abstract
Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often assumes a deep understanding of low-level physical prerequisites, a capability that remains largely unverified. For robots to perform actions reliably, they must comprehend intrinsic object properties (e.g., material, weight), action affordances (e.g., graspable, stackable), and physical constraints (e.g., stability, reachability, or an object's state, such as being closed). Despite the widespread use of VLMs in manipulation tasks, we argue that off-the-shelf models may lack this granular, physically grounded understanding, as such prerequisites are often overlooked during training. To address this critical gap, we introduce PAC Bench, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
