SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Haowen Liu; Shaoxiong Yao; Haonan Chen; Jiawei Gao; Jiayuan Mao; Jia-Bin Huang; Yilun Du

arXiv:2512.05955·cs.RO·April 1, 2026

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, Yilun Du

PDF

1 Repo

TL;DR

SIMPACT enhances vision-language models with physics simulation at test time, enabling better physical reasoning and manipulation in robotics without additional training.

Contribution

The paper introduces a simulation-in-the-loop framework that equips VLMs with physical reasoning capabilities during test time, improving robotic manipulation performance.

Findings

01

Achieves state-of-the-art results on five manipulation tasks.

02

Effectively models contact dynamics and action outcomes.

03

Operates without additional training, using only a single RGB-D observation.

Abstract

Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://simpact-bot.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.