Grounded Vision-Language Interpreter for Integrated Task and Motion Planning
Jeremy Siburian, Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Michael G\"orner, Atsushi Hashimoto

TL;DR
This paper introduces ViLaIn-TAMP, a hybrid planning framework combining vision-language interpretation with symbolic and geometric planning, enhancing safety, interpretability, and success rates in robot task execution.
Contribution
It presents a novel hybrid framework that integrates vision-language models with symbolic and geometric planning, including a feedback loop for refinement, improving robot planning reliability.
Findings
Outperforms baseline by 18% in success rate
Adding corrective planning boosts success by 32%
Effective in complex cooking manipulation tasks
Abstract
While recent advances in vision-language models have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) a Vision-Language Interpreter (ViLaIn) adapted from previous work that converts multimodal inputs into structured problem specifications, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning, and (3) a corrective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
