Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins
Chuanruo Ning, Kuan Fang, Wei-Chiu Ma

TL;DR
This paper introduces a model predictive control framework that combines vision-language models with interactive digital twins to improve low-level robot control in open-world manipulation tasks, leveraging scene simulation and future observation prompting.
Contribution
It presents a novel integration of VLMs with physically-grounded digital twins for enhanced robotic control, enabling better scene understanding and trajectory planning.
Findings
Outperforms baseline methods in complex manipulation tasks
Enhances scene understanding through digital twin rendering
Improves low-level control accuracy in open-world scenarios
Abstract
Recent advancements in open-world robot manipulation have been largely driven by vision-language models (VLMs). While these models exhibit strong generalization ability in high-level planning, they struggle to predict low-level robot controls due to limited physical-world understanding. To address this issue, we propose a model predictive control framework for open-world manipulation that combines the semantic reasoning capabilities of VLMs with physically-grounded, interactive digital twins of the real-world environments. By constructing and simulating the digital twins, our approach generates feasible motion trajectories, simulates corresponding outcomes, and prompts the VLM with future observations to evaluate and select the most suitable outcome based on language instructions of the task. To further enhance the capability of pre-trained VLMs in understanding complex scenes for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization
