TL;DR
SIMPACT enhances vision-language models with physics simulation at test time, enabling better physical reasoning and manipulation in robotics without additional training.
Contribution
The paper introduces a simulation-in-the-loop framework that equips VLMs with physical reasoning capabilities during test time, improving robotic manipulation performance.
Findings
Achieves state-of-the-art results on five manipulation tasks.
Effectively models contact dynamics and action outcomes.
Operates without additional training, using only a single RGB-D observation.
Abstract
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
